Skip to content

Datasets ​

In Hectiq Lab, datasets can be accessed using the Dataset singleton. The Dataset class encapsulates a variety of methods to download data files in the cloud or paths where those files are locally saved, or push them in the lab. Using datasets is a good practice for tracking data sources used in machine learning experiments.

Datasets are associated with projects and can also be attached to runs. Each dataset is given a type, a data source and a hostname. This way, it is possible to define dataset objects that points to local directories on your computer, to S3 buckets or to the Hectiq Lab cloud services.

Create dataset ​

To create a dataset, use the create method. This method will create an instance of dataset in the application. It is available from both the functional and object-oriented APIs, and the command-line interface.

python
import pyhectiqlab.functional as hl
hl.create_dataset(name="dataset_name", source="/path/to/the/dataset/")
python
from pyhectiqlab import Dataset
dataset = Dataset.create(name="dataset_name", source="/path/to/the/dataset/")
bash
hectiq-lab Dataset.create --name "dataset_name" --source "/path/to/the/dataset/" --project hectiq-ai/demo
NameTypeDefaultDescription
namestr-Name of the dataset.
sourcestr-Path to the dataset. If your dataset is located in a local directory, the source should be the directory path (e.g., "/path/to/dataset"). If the dataset is located in a cloud storage, the source should be the URL of the dataset (e.g., "s3://bucket/dataset").
hoststrNoneHost of the dataset. Default: None. If the dataset is located in a local directory, the host could be your hostname or leave it empty. If the dataset is located in a cloud storage, the host should be the cloud storage name ("s3" or "gs")
descriptionstrNoneDescription of the dataset.
versionstrNoneVersion of the dataset.
typestrNoneType of the dataset.
run_idstrNoneID of the run to attach to the dataset. If None, the dataset is not attached. Default: None.
projectstrNoneProject of the dataset.
uploadboolTrueIf True, uploads the local dataset to the Lab.

WARNING

If the name or source parameters are not provided, the project is not found or the dataset creation failed, it logs an error and returns None.

Upload files to the dataset ​

To upload files to the dataset, use the upload method. This method will upload the files to the dataset in the cloud.

python
import pyhectiqlab.functional as hl
hl.upload_dataset(id="datasetID", source="/path/to/the/dataset/")
python
from pyhectiqlab import Dataset
Dataset.upload(id="datasetID", source="/path/to/the/dataset/")
bash
hectiq-lab Dataset.upload --id "datasetID" --source "/path/to/the/dataset/"
NameTypeDefaultDescription
idstr-ID of the dataset.
sourcestr-Source of the dataset.

Retrieve a dataset ​

To retrieve an existing dataset by name and version, use the Dataset.retrieve method.

python
import pyhectiqlab.functional as hl
hl.retrieve_dataset(name="dataset_name", version="0.1.0")
python
from pyhectiqlab import Dataset
model = Dataset.retrieve(name="dataset_name", version="0.1.0")
bash
hectiq-lab Dataset.retrieve --name "dataset_name" --version "0.1.0" --project hectiq-ai/demo
NameTypeDefaultDescription
namestr-Name of the dataset
projectstr-Project of the dataset
versionstr-Version of the dataset
fieldslist[str]NoneFields to retrieve.

::: warnings If the project is not found, it logs an error and returns None. :::

Download a dataset locally ​

A dataset that has been uploaded to the Hectiq Lab can be downloaded using the download method

python
import pyhectiqlab.functional as hl
hl.download_dataset(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")
python
from pyhectiqlab import Dataset
Dataset.download(name="dataset-name", version="1.0.0", path="/path/to/the/dataset")
bash
hectiq-lab Dataset.download --name "dataset-name" --version "1.0.0" --path "/path/to/the/dataset"
NameTypeDefaultDescription
namestr-Name of the dataset.
versionstr-Version of the dataset.
projectstr, optionalNoneProject of the dataset.
pathstrNonePath to download the dataset. If None, it uses the HECTIQLAB_DATASETS_DOWNLOAD environment variable or the current directory.
overwriteboolFalseWhether to overwrite the existing files.

::: warnings If the project is not found, or if the dataset is not found, it logs an error and returns None. :::

Delete a dataset ​

To delete a dataset from the repository, use the delete method. This can be done either given the id or the name and version of the dataset.

python
import pyhectiqlab.functional as hl
hl.delete_dataset(name="dataset-name", version="1.0.0")
python
from pyhectiqlab import Dataset
Dataset.delete(name="dataset-name", version="1.0.0")
bash
hectiq-lab Dataset.delete --name "dataset-name" --version "1.0.0"
NameTypeDefaultDescription
idstr-ID of the dataset.
namestrNoneName of the dataset.
versionstrNoneVersion of the dataset.
projectstrNoneProject of the dataset.
wait_responseboolFalseWait for the response from the server.

Update a dataset ​

The list method allows to update the properties of a dataset given its ID.

DANGER

The name and version of a model can be updated. The update uses the id to find the model to update.

python
import pyhectiqlab.functional as hl
hl.update(id="dataset-id", name="new_dataset_name")
python
from pyhectiqlab import Dataset
Dataset.update(id="dataset-id", name="new_dataset_name")
bash
hectiq-lab Dataset.update --id "dataset-id" --name "new_dataset_name"
NameTypeDefaultDescription
idstr-ID of the dataset.
namestr, optional-Name of the dataset.
descriptionstr, optional-Description of the dataset.
versionstr, optional-Version of the dataset.
blockstr, optional-Block of the dataset.
wait_responseboolFalseWait for the response from the server.

List datasets ​

To list the datasets, use the list method.

python
import pyhectiqlab.functional as hl
hl.list(project="hectiq-ai/demo")
python
from pyhectiqlab import Dataset
Dataset.list(project="hectiq-ai/demo")
bash
hectiq-lab Dataset.list --project "hectiq-ai/demo"
NameTypeDefaultDescription
projectstr-Project of the dataset.
searchstr-Search string.
authorstr-Author of the dataset.
keep_latest_versionboolFalseIf True, only returns the latest version of each model name, grouped by dataset name.
fieldslist[str]-Fields to retrieve.
pageint-Page number.
limitint-Limit of the datasets.
order_bystr-Order by.
order_directionstr-Order direction.
wait_responseboolFalseWait for the response from the server.

Attach / Detach datasets to a run ​

A dataset can be attached or detached from a run. To do so, use the attach and detach method.

python
import pyhectiqlab.functional as hl
hl.attach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")
hl.detach_dataset(name="dataset-name", version="1.0.0", run_id="1wekv90")
python
from pyhectiqlab import Dataset
Dataset.attach(name="dataset-name", version="1.0.0", run_id="1wekv90")
Dataset.detach(name="dataset-name", version="1.0.0", run_id="1wekv90")
bash
hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"
hectiq-lab Dataset.detach --name "dataset-name" --version "1.0.0" --run_id "1wekv90"
NameTypeDefaultDescription
namestr-Name of the dataset.
versionstr-Version of the dataset.
run_idstrNoneID of the run.
projectstrNoneProject of the dataset.
wait_responseboolFalseWait for the response from the server.

WARNING

If the run_id parameter is not provided or the dataset is not found, it logs an error and returns None.

Attach / detach tags to the dataset ​

Like for runs, tags can be attached / detached to datasets by using the add_tags and detach_tag method.

python
import pyhectiqlab.functional as hl
hl.add_tags_to_dataset(name="dataset-name", version="1.0.0", tags=["some", "tag"])
hl.detach_tag_from_dataset(name="dataset-name", version="1.0.0", tag="some")
python
from pyhectiqlab import Dataset
Dataset.add_tags(name="dataset-name", version="1.0.0", tags=["some", "tag"])
Dataset.detach_tag(name="dataset-name", version="1.0.0", tag="some")
bash
hectiq-lab Dataset.attach --name "dataset-name" --version "1.0.0" --tags "some" --tags "tag"
hectiq-lab Dataset.detach_tag --name "dataset-name" --version "1.0.0" --tag "some"

WARNING

If the dataset is not found, it logs an error and returns None.