Datasets ​
In Hectiq Lab, datasets can be accessed using the Dataset
singleton. The Dataset
class encapsulates a variety of methods to download data files in the cloud or paths where those files are locally saved, or push them in the lab. Using datasets is a good practice for tracking data sources used in machine learning experiments.
Datasets are associated with projects and can also be attached to runs. Each dataset is given a type, a data source and a hostname. This way, it is possible to define dataset objects that points to local directories on your computer, to S3 buckets or to the Hectiq Lab cloud services.
Create dataset ​
To create a dataset, use the create
method. This method will create an instance of dataset in the application. It is available from both the functional and object-oriented APIs, and the command-line interface.
import pyhectiqlab.functional as hl
hl.create_dataset(name="dataset_name", source="/path/to/the/dataset/")
from pyhectiqlab import Dataset
dataset = Dataset.create(name="dataset_name", source="/path/to/the/dataset/")
hectiq-lab Dataset.create --name "dataset_name" --source "/path/to/the/dataset/" --project hectiq-ai/demo
Name | Type | Default | Description |
---|---|---|---|
name | str | - | Name of the dataset. |
source | str | - | Path to the dataset. If your dataset is located in a local directory, the source should be the directory path (e.g., "/path/to/dataset"). If the dataset is located in a cloud storage, the source should be the URL of the dataset (e.g., "s3://bucket/dataset"). |
host | str | None | Host of the dataset. Default: None. If the dataset is located in a local directory, the host could be your hostname or leave it empty. If the dataset is located in a cloud storage, the host should be the cloud storage name ("s3" or "gs") |
description | str | None | Description of the dataset. |
version | str | None | Version of the dataset. |
type | str | None | Type of the dataset. |
run_id | str | None | ID of the run to attach to the dataset. If None, the dataset is not attached. Default: None. |
project | str | None | Project of the dataset. |
upload | bool | True | If True, uploads the local dataset to the Lab. |
WARNING
If the name
or source
parameters are not provided, the project is not found or the dataset creation failed, it logs an error and returns None
.
Upload files to the dataset ​
To upload files to the dataset, use the upload
method. This method will upload the files to the dataset in the cloud.
import pyhectiqlab.functional as hl
hl.upload_dataset(id="datasetID", source="/path/to/the/dataset/")
from pyhectiqlab import Dataset
Dataset.upload(id="datasetID", source="/path/to/the/dataset/")
hectiq-lab Dataset.upload --id "datasetID" --source "/path/to/the/dataset/"
Name | Type | Default | Description |
---|---|---|---|
id | str | - | ID of the dataset. |
source | str | - | Source of the dataset. |
Retrieve a dataset ​
To retrieve an existing dataset by name and version, use the Dataset.retrieve
method.
import pyhectiqlab.functional as hl
hl.retrieve_dataset(name="dataset_name", version="0.1.0")
from pyhectiqlab import Dataset
model = Dataset.retrieve(name="dataset_name", version="0.1.0")
hectiq-lab Dataset.retrieve --name "dataset_name" --version "0.1.0" --project hectiq-ai/demo
Name | Type | Default | Description |
---|---|---|---|
name | str | - | Name of the dataset |
project | str | - | Project of the dataset |
version | str | - | Version of the dataset |
fields | list[str] | None | Fields to retrieve. |
::: warnings If the project is not found, it logs an error and returns None
. :::
Download a dataset locally ​
A dataset that has been uploaded to the Hectiq Lab can be downloaded using the download
method
import pyhectiqlab.functional as hl
hl.download_dataset(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")
from pyhectiqlab import Dataset
Dataset.download(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")
hectiq-lab Dataset.download --name "dataset-name" --version "1.0.0" --path "/path/to/the/dataset"
Name | Type | Default | Description |
---|---|---|---|
name | str | - | Name of the dataset. |
version | str | - | Version of the dataset. |
project | str , optional | None | Project of the dataset. |
path | str | None | Path to download the dataset. If None, it uses the HECTIQLAB_DATASETS_DOWNLOAD environment variable or the current directory. |
overwrite | bool | False | Whether to overwrite the existing files. |
::: warnings If the project is not found, or if the dataset is not found, it logs an error and returns None
. :::
Delete a dataset ​
To delete a dataset from the repository, use the delete
method. This can be done either given the id
or the name
and version
of the dataset.
import pyhectiqlab.functional as hl
hl.delete_dataset(name="dataset-name", version="1.0.0")
from pyhectiqlab import Dataset
Dataset.delete(name="dataset-name", version="1.0.0")
hectiq-lab Dataset.delete --name "dataset-name" --version "1.0.0"
Name | Type | Default | Description |
---|---|---|---|
id | str | - | ID of the dataset. |
name | str | None | Name of the dataset. |
version | str | None | Version of the dataset. |
project | str | None | Project of the dataset. |
wait_response | bool | False | Wait for the response from the server. |
Update a dataset ​
The list
method allows to update the properties of a dataset given its ID.
DANGER
The name and version of a model can be updated. The update uses the id
to find the model to update.
import pyhectiqlab.functional as hl
hl.update(id="dataset-id", name="new_dataset_name")
from pyhectiqlab import Dataset
Dataset.update(id="dataset-id", name="new_dataset_name")
hectiq-lab Dataset.update --id "dataset-id" --name "new_dataset_name"
Name | Type | Default | Description |
---|---|---|---|
id | str | - | ID of the dataset. |
name | str, optional | - | Name of the dataset. |
description | str, optional | - | Description of the dataset. |
version | str, optional | - | Version of the dataset. |
block | str, optional | - | Block of the dataset. |
wait_response | bool | False | Wait for the response from the server. |
List datasets ​
To list the datasets, use the list
method.
import pyhectiqlab.functional as hl
hl.list(project="hectiq-ai/demo")
from pyhectiqlab import Dataset
Dataset.list(project="hectiq-ai/demo")
hectiq-lab Dataset.list --project "hectiq-ai/demo"
Name | Type | Default | Description |
---|---|---|---|
project | str | - | Project of the dataset. |
search | str | - | Search string. |
author | str | - | Author of the dataset. |
keep_latest_version | bool | False | If True , only returns the latest version of each model name, grouped by dataset name. |
fields | list[str] | - | Fields to retrieve. |
page | int | - | Page number. |
limit | int | - | Limit of the datasets. |
order_by | str | - | Order by. |
order_direction | str | - | Order direction. |
wait_response | bool | False | Wait for the response from the server. |
Attach / Detach datasets to a run ​
A dataset can be attached or detached from a run. To do so, use the attach
and detach
method.
import pyhectiqlab.functional as hl
hl.attach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")
hl.detach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")
from pyhectiqlab import Dataset
Dataset.attach(name="dataset-name", version="1.0.0", run_id="1wekv90")
Dataset.detach(name="dataset-name", version="1.0.0", run_id="1wekv90")
hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"
hectiq-lab Dataset.detach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"
Name | Type | Default | Description |
---|---|---|---|
name | str | - | Name of the dataset. |
version | str | - | Version of the dataset. |
run_id | str | None | ID of the run. |
project | str | None | Project of the dataset. |
wait_response | bool | False | Wait for the response from the server. |
WARNING
If the run_id
parameter is not provided or the dataset is not found, it logs an error and returns None
.
Attach / detach tags to the dataset ​
Like for runs, tags can be attached / detached to datasets by using the add_tags
and detach_tag
method.
import pyhectiqlab.functional as hl
hl.add_tags_to_dataset(name="dataset-name", version="1.0.0", tags=["some", "tag"])
hl.detach_tag_from_dataset(name="dataset-name", version="1.0.0", tag="some")
from pyhectiqlab import Dataset
Dataset.add_tags(name="dataset-name", version="1.0.0", tags=["some", "tag"])
Dataset.detach_tag(name="dataset-name", version="1.0.0", tag="some")
hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --tags "some" --tags "tag"
hectiq-lab Dataset.detach_tag --name "dataset-name" --version "1.0.0" --tag "some"
WARNING
If the dataset is not found, it logs an error and returns None
.