Datasets ​
In Hectiq Lab, datasets can be accessed using the Dataset singleton. The Dataset class encapsulates a variety of methods to download data files in the cloud or paths where those files are locally saved, or push them in the lab. Using datasets is a good practice for tracking data sources used in machine learning experiments.
Datasets are associated with projects and can also be attached to runs. Each dataset is given a type, a data source and a hostname. This way, it is possible to define dataset objects that points to local directories on your computer, to S3 buckets or to the Hectiq Lab cloud services.
Create dataset ​
To create a dataset, use the create method. This method will create an instance of dataset in the application. It is available from both the functional and object-oriented APIs, and the command-line interface.
import pyhectiqlab.functional as hl
hl.create_dataset(name="dataset_name", source="/path/to/the/dataset/")from pyhectiqlab import Dataset
dataset = Dataset.create(name="dataset_name", source="/path/to/the/dataset/")hectiq-lab Dataset.create --name "dataset_name" --source "/path/to/the/dataset/" --project hectiq-ai/demo| Name | Type | Default | Description |
|---|---|---|---|
name | str | - | Name of the dataset. |
source | str | - | Path to the dataset. If your dataset is located in a local directory, the source should be the directory path (e.g., "/path/to/dataset"). If the dataset is located in a cloud storage, the source should be the URL of the dataset (e.g., "s3://bucket/dataset"). |
host | str | None | Host of the dataset. Default: None. If the dataset is located in a local directory, the host could be your hostname or leave it empty. If the dataset is located in a cloud storage, the host should be the cloud storage name ("s3" or "gs") |
description | str | None | Description of the dataset. |
version | str | None | Version of the dataset. |
type | str | None | Type of the dataset. |
run_id | str | None | ID of the run to attach to the dataset. If None, the dataset is not attached. Default: None. |
project | str | None | Project of the dataset. |
upload | bool | True | If True, uploads the local dataset to the Lab. |
WARNING
If the name or source parameters are not provided, the project is not found or the dataset creation failed, it logs an error and returns None.
Upload files to the dataset ​
To upload files to the dataset, use the upload method. This method will upload the files to the dataset in the cloud.
import pyhectiqlab.functional as hl
hl.upload_dataset(id="datasetID", source="/path/to/the/dataset/")from pyhectiqlab import Dataset
Dataset.upload(id="datasetID", source="/path/to/the/dataset/")hectiq-lab Dataset.upload --id "datasetID" --source "/path/to/the/dataset/"| Name | Type | Default | Description |
|---|---|---|---|
id | str | - | ID of the dataset. |
source | str | - | Source of the dataset. |
Retrieve a dataset ​
To retrieve an existing dataset by name and version, use the Dataset.retrieve method.
import pyhectiqlab.functional as hl
hl.retrieve_dataset(name="dataset_name", version="0.1.0")from pyhectiqlab import Dataset
model = Dataset.retrieve(name="dataset_name", version="0.1.0")hectiq-lab Dataset.retrieve --name "dataset_name" --version "0.1.0" --project hectiq-ai/demo| Name | Type | Default | Description |
|---|---|---|---|
name | str | - | Name of the dataset |
project | str | - | Project of the dataset |
version | str | - | Version of the dataset |
fields | list[str] | None | Fields to retrieve. |
::: warnings If the project is not found, it logs an error and returns None. :::
Download a dataset locally ​
A dataset that has been uploaded to the Hectiq Lab can be downloaded using the download method
import pyhectiqlab.functional as hl
hl.download_dataset(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")from pyhectiqlab import Dataset
Dataset.download(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")hectiq-lab Dataset.download --name "dataset-name" --version "1.0.0" --path "/path/to/the/dataset"| Name | Type | Default | Description |
|---|---|---|---|
name | str | - | Name of the dataset. |
version | str | - | Version of the dataset. |
project | str, optional | None | Project of the dataset. |
path | str | None | Path to download the dataset. If None, it uses the HECTIQLAB_DATASETS_DOWNLOAD environment variable or the current directory. |
overwrite | bool | False | Whether to overwrite the existing files. |
::: warnings If the project is not found, or if the dataset is not found, it logs an error and returns None. :::
Delete a dataset ​
To delete a dataset from the repository, use the delete method. This can be done either given the id or the name and version of the dataset.
import pyhectiqlab.functional as hl
hl.delete_dataset(name="dataset-name", version="1.0.0")from pyhectiqlab import Dataset
Dataset.delete(name="dataset-name", version="1.0.0")hectiq-lab Dataset.delete --name "dataset-name" --version "1.0.0"| Name | Type | Default | Description |
|---|---|---|---|
id | str | - | ID of the dataset. |
name | str | None | Name of the dataset. |
version | str | None | Version of the dataset. |
project | str | None | Project of the dataset. |
wait_response | bool | False | Wait for the response from the server. |
Update a dataset ​
The list method allows to update the properties of a dataset given its ID.
DANGER
The name and version of a model can be updated. The update uses the id to find the model to update.
import pyhectiqlab.functional as hl
hl.update(id="dataset-id", name="new_dataset_name")from pyhectiqlab import Dataset
Dataset.update(id="dataset-id", name="new_dataset_name")hectiq-lab Dataset.update --id "dataset-id" --name "new_dataset_name"| Name | Type | Default | Description |
|---|---|---|---|
id | str | - | ID of the dataset. |
name | str, optional | - | Name of the dataset. |
description | str, optional | - | Description of the dataset. |
version | str, optional | - | Version of the dataset. |
block | str, optional | - | Block of the dataset. |
wait_response | bool | False | Wait for the response from the server. |
List datasets ​
To list the datasets, use the list method.
import pyhectiqlab.functional as hl
hl.list(project="hectiq-ai/demo")from pyhectiqlab import Dataset
Dataset.list(project="hectiq-ai/demo")hectiq-lab Dataset.list --project "hectiq-ai/demo"| Name | Type | Default | Description |
|---|---|---|---|
project | str | - | Project of the dataset. |
search | str | - | Search string. |
author | str | - | Author of the dataset. |
keep_latest_version | bool | False | If True, only returns the latest version of each model name, grouped by dataset name. |
fields | list[str] | - | Fields to retrieve. |
page | int | - | Page number. |
limit | int | - | Limit of the datasets. |
order_by | str | - | Order by. |
order_direction | str | - | Order direction. |
wait_response | bool | False | Wait for the response from the server. |
Attach / Detach datasets to a run ​
A dataset can be attached or detached from a run. To do so, use the attach and detach method.
import pyhectiqlab.functional as hl
hl.attach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")
hl.detach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")from pyhectiqlab import Dataset
Dataset.attach(name="dataset-name", version="1.0.0", run_id="1wekv90")
Dataset.detach(name="dataset-name", version="1.0.0", run_id="1wekv90")hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"
hectiq-lab Dataset.detach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"| Name | Type | Default | Description |
|---|---|---|---|
name | str | - | Name of the dataset. |
version | str | - | Version of the dataset. |
run_id | str | None | ID of the run. |
project | str | None | Project of the dataset. |
wait_response | bool | False | Wait for the response from the server. |
WARNING
If the run_id parameter is not provided or the dataset is not found, it logs an error and returns None.
Attach / detach tags to the dataset ​
Like for runs, tags can be attached / detached to datasets by using the add_tags and detach_tag method.
import pyhectiqlab.functional as hl
hl.add_tags_to_dataset(name="dataset-name", version="1.0.0", tags=["some", "tag"])
hl.detach_tag_from_dataset(name="dataset-name", version="1.0.0", tag="some")from pyhectiqlab import Dataset
Dataset.add_tags(name="dataset-name", version="1.0.0", tags=["some", "tag"])
Dataset.detach_tag(name="dataset-name", version="1.0.0", tag="some")hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --tags "some" --tags "tag"
hectiq-lab Dataset.detach_tag --name "dataset-name" --version "1.0.0" --tag "some"WARNING
If the dataset is not found, it logs an error and returns None.