Datasets

In Hectiq Lab, datasets can be accessed using the Dataset singleton. The Dataset class encapsulates a variety of methods to download data files in the cloud or paths where those files are locally saved, or push them in the lab. Using datasets is a good practice for tracking data sources used in machine learning experiments.

Datasets are associated with projects and can also be attached to runs. Each dataset is given a type, a data source and a hostname. This way, it is possible to define dataset objects that points to local directories on your computer, to S3 buckets or to the Hectiq Lab cloud services.

Create dataset

To create a dataset, use the create method. This method will create an instance of dataset in the application. It is available from both the functional and object-oriented APIs, and the command-line interface.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.create_dataset(name="dataset_name", source="/path/to/the/dataset/")

python

from pyhectiqlab import Dataset
dataset = Dataset.create(name="dataset_name", source="/path/to/the/dataset/")

bash

hectiq-lab Dataset.create --name "dataset_name" --source "/path/to/the/dataset/" --project hectiq-ai/demo

Name	Type	Default	Description
`name`	`str`	-	Name of the dataset.
`source`	`str`	-	Path to the dataset. If your dataset is located in a local directory, the source should be the directory path (e.g., "/path/to/dataset"). If the dataset is located in a cloud storage, the source should be the URL of the dataset (e.g., "s3://bucket/dataset").
`host`	`str`	`None`	Host of the dataset. Default: None. If the dataset is located in a local directory, the host could be your hostname or leave it empty. If the dataset is located in a cloud storage, the host should be the cloud storage name ("s3" or "gs")
`description`	`str`	`None`	Description of the dataset.
`version`	`str`	`None`	Version of the dataset.
`type`	`str`	`None`	Type of the dataset.
`run_id`	`str`	`None`	ID of the run to attach to the dataset. If None, the dataset is not attached. Default: None.
`project`	`str`	`None`	Project of the dataset.
`upload`	`bool`	`True`	If True, uploads the local dataset to the Lab.

WARNING

If the name or source parameters are not provided, the project is not found or the dataset creation failed, it logs an error and returns None.

Upload files to the dataset

To upload files to the dataset, use the upload method. This method will upload the files to the dataset in the cloud.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.upload_dataset(id="datasetID", source="/path/to/the/dataset/")

python

from pyhectiqlab import Dataset
Dataset.upload(id="datasetID", source="/path/to/the/dataset/")

bash

hectiq-lab Dataset.upload --id "datasetID" --source "/path/to/the/dataset/"

Name	Type	Default	Description
`id`	`str`	-	ID of the dataset.
`source`	`str`	-	Source of the dataset.

Retrieve a dataset

To retrieve an existing dataset by name and version, use the Dataset.retrieve method.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.retrieve_dataset(name="dataset_name", version="0.1.0")

python

from pyhectiqlab import Dataset
model = Dataset.retrieve(name="dataset_name", version="0.1.0")

bash

hectiq-lab Dataset.retrieve --name "dataset_name" --version "0.1.0" --project hectiq-ai/demo

Name	Type	Default	Description
`name`	`str`	-	Name of the dataset
`project`	`str`	-	Project of the dataset
`version`	`str`	-	Version of the dataset
`fields`	`list[str]`	None	Fields to retrieve.

::: warnings If the project is not found, it logs an error and returns None. :::

Download a dataset locally

A dataset that has been uploaded to the Hectiq Lab can be downloaded using the download method

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.download_dataset(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")

python

from pyhectiqlab import Dataset
Dataset.download(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")

bash

hectiq-lab Dataset.download --name "dataset-name" --version "1.0.0" --path "/path/to/the/dataset"

Name	Type	Default	Description
`name`	`str`	-	Name of the dataset.
`version`	`str`	-	Version of the dataset.
`project`	`str`, optional	`None`	Project of the dataset.
`path`	`str`	`None`	Path to download the dataset. If None, it uses the `HECTIQLAB_DATASETS_DOWNLOAD` environment variable or the current directory.
`overwrite`	`bool`	`False`	Whether to overwrite the existing files.

::: warnings If the project is not found, or if the dataset is not found, it logs an error and returns None. :::

Delete a dataset

To delete a dataset from the repository, use the delete method. This can be done either given the id or the name and version of the dataset.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.delete_dataset(name="dataset-name", version="1.0.0")

python

from pyhectiqlab import Dataset
Dataset.delete(name="dataset-name", version="1.0.0")

bash

hectiq-lab Dataset.delete --name "dataset-name" --version "1.0.0"

Name	Type	Default	Description
`id`	`str`	-	ID of the dataset.
`name`	`str`	`None`	Name of the dataset.
`version`	`str`	`None`	Version of the dataset.
`project`	`str`	`None`	Project of the dataset.
`wait_response`	`bool`	`False`	Wait for the response from the server.

Update a dataset

The list method allows to update the properties of a dataset given its ID.

DANGER

The name and version of a model can be updated. The update uses the id to find the model to update.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.update(id="dataset-id", name="new_dataset_name")

python

from pyhectiqlab import Dataset
Dataset.update(id="dataset-id", name="new_dataset_name")

bash

hectiq-lab Dataset.update --id "dataset-id" --name "new_dataset_name"

Name	Type	Default	Description
`id`	`str`	-	ID of the dataset.
`name`	`str, optional`	-	Name of the dataset.
`description`	`str, optional`	-	Description of the dataset.
`version`	`str, optional`	-	Version of the dataset.
`block`	`str, optional`	-	Block of the dataset.
`wait_response`	`bool`	`False`	Wait for the response from the server.

List datasets

To list the datasets, use the list method.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.list(project="hectiq-ai/demo")

python

from pyhectiqlab import Dataset
Dataset.list(project="hectiq-ai/demo")

bash

hectiq-lab Dataset.list --project "hectiq-ai/demo"

Name	Type	Default	Description
`project`	`str`	-	Project of the dataset.
`search`	`str`	-	Search string.
`author`	`str`	-	Author of the dataset.
`keep_latest_version`	`bool`	False	If `True`, only returns the latest version of each model name, grouped by dataset name.
`fields`	`list[str]`	-	Fields to retrieve.
`page`	`int`	-	Page number.
`limit`	`int`	-	Limit of the datasets.
`order_by`	`str`	-	Order by.
`order_direction`	`str`	-	Order direction.
`wait_response`	`bool`	`False`	Wait for the response from the server.

Attach / Detach datasets to a run

A dataset can be attached or detached from a run. To do so, use the attach and detach method.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.attach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")
hl.detach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")

python

from pyhectiqlab import Dataset
Dataset.attach(name="dataset-name", version="1.0.0", run_id="1wekv90")
Dataset.detach(name="dataset-name", version="1.0.0", run_id="1wekv90")

bash

hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"
hectiq-lab Dataset.detach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"

Name	Type	Default	Description
`name`	`str`	-	Name of the dataset.
`version`	`str`	-	Version of the dataset.
`run_id`	`str`	`None`	ID of the run.
`project`	`str`	`None`	Project of the dataset.
`wait_response`	`bool`	`False`	Wait for the response from the server.

WARNING

If the run_id parameter is not provided or the dataset is not found, it logs an error and returns None.

Attach / detach tags to the dataset

Like for runs, tags can be attached / detached to datasets by using the add_tags and detach_tag method.

FunctionalObject-OrientedCLI

python

import pyhectiqlab.functional as hl
hl.add_tags_to_dataset(name="dataset-name", version="1.0.0", tags=["some", "tag"])
hl.detach_tag_from_dataset(name="dataset-name", version="1.0.0", tag="some")

python

from pyhectiqlab import Dataset
Dataset.add_tags(name="dataset-name", version="1.0.0", tags=["some", "tag"])
Dataset.detach_tag(name="dataset-name", version="1.0.0", tag="some")

bash

hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --tags "some" --tags "tag"
hectiq-lab Dataset.detach_tag --name "dataset-name" --version "1.0.0" --tag "some"

WARNING

If the dataset is not found, it logs an error and returns None.

Datasets ​

Create dataset ​

Upload files to the dataset ​

Retrieve a dataset ​

Download a dataset locally ​

Delete a dataset ​

Update a dataset ​

List datasets ​

Attach / Detach datasets to a run ​

Attach / detach tags to the dataset ​

Datasets

Create dataset

Upload files to the dataset

Retrieve a dataset

Download a dataset locally

Delete a dataset

Update a dataset

List datasets

Attach / Detach datasets to a run

Attach / detach tags to the dataset