objects

Datasets

Datasets is a designated place to store your Machine Learning datasets within a project. They can be easily shared within a project, used in apps and tracked.

Upload

Datasets can take the form of a directory or a single file. Suppose you have a dataset saved on your local machine with a set of images ./dataset/*.png, then you can push the dataset from a run context with

from pyhectiqlab import Run
run = Run(name="Push images dataset", project="lab/demo")

dataset_dir = "./datasets/"
run.add_dataset(dataset_dir, name="cat-images", push_dir=True)

Your dataset is now uploaded to the lab. Dataset upload is made sync on the main thread.

Download

The download code snippets are always available prefilled from the web application, in the tab Usage.

You can download any datasets from a project you have access to. You can download datasets using the python client or the command line client.

Python client

If you choose to download the dataset from the python client, you can choose to use a run context such as

run.download_dataset("cat-images")

When downloaded from within a run context, the run is automatically attached to the dataset. In other words, in the run page, a link to the dataset will be made available. Eventually, we plan to integrate the full dataset-model-run flow to help users understand their experiments. For this reason, we recommend using run.download_dataset to prepare the flow diagram.

The second option is to download it outside a run:

from pyhectiqlab.datasets import download_dataset
download_dataset(dataset_name='cat-images', project_path='lab/demo')

Command line

The third option is to use the command line

hectiqlab download-dataset -p lab/demo -n cat-images

For all download methods, other parameters are available (dataset version, save path, overwrite). See methods details.

When you download a dataset, a directory is created with the format {name}-{version}. Also with the dataset files, a HLab_README.md is created with meta information and a hidden file .hlab_meta.json.

Record usage

It is often useful to track the existing dataset usage. You can use run.add_dataset_usage and run.add_dataset_usage_from_dirpath to keep track of this usage. The former takes the name and version as arguments while the latter infers the name and version from the path directory where it's been saved. Using these methods help build the flow diagram of your practice.

from pyhectiqlab import Run
run = Run(name="Linear regression prediction", project="lab/demo")

dataset_path = "./cat-images-1.0.0"
run.add_dataset_usage_from_dirpath(dataset_path)

# Equivalent
run.add_dataset_usage("cat-images", version="1.0.0")

Methods

pyhectiqlab.Run

Run.add_dataset(source_path: str, name: str, version: str = None, description: str = None, push_dir: bool = False, resume_upload: bool = False)
Upload a dataset to the lab.

Parameters

PropertyTypeDefaultDescription
source_pathstr-Path to the dataset file or directory.
namestr-The dataset name. The lab will use a slugified version of the name.
versionstrNoneVersion in format '{major}.{minor}.{micro}' (e.g., 1.2.0). If `None`, the version 1.0.0 is assigned or an increment of minor of the latest version of the dataset with this name (e.g. 1.3.3 -> 1.4.0)
descriptionstrNoneAn optional short description of the dataset. For larger description, push a README.md file in the root path.
push_dirboolFalseIf the source_path is a directory, use push_dir=True to confirm that you understand that a directory will be uploaded.
resume_uploadboolFalseIf True, you'll be able to push files on an existing dataset version.
Run.download_dataset(dataset_name: str, version: str = None, save_path: str = ./, overwrite: bool = False)
Download an existing dataset from a run. Returns the path to the saved dataset.

Parameters

PropertyTypeDefaultDescription
dataset_namestr-Name of the dataset.
versionstrNoneSpecific version of the dataset. If None, the latest version is fetched.
save_pathstr./Path to where the dataset will be saved.
overwriteboolFalse Set to True if you want to download the dataset again even if it is already saved on your machine.
Run.add_dataset_usage_from_dirpath(dirpath: str)
Attach a run to an existing dataset saved on your machine. The name and version of the dataset is inferred from the directory.

Parameters

PropertyTypeDefaultDescription
dirpathstr-Path to where the dataset is saved.
Run.add_dataset_usage(name: str, version: str)
Attach the run to an existing dataset.

Parameters

PropertyTypeDefaultDescription
namestr-Dataset name
versionstr-Dataset version

pyhectiqlab

pyhectiqlab.datasets.download_dataset(dataset_name: str, project: str, version: str, save_path: str = ./, overwrite: str = False)
Download an existing dataset without a run context.

Parameters

PropertyTypeDefaultDescription
dataset_namestr-Dataset name
projectstr-Project name
versionstr-Specific version of the dataset. If None, the latest version is fetched.
save_pathstr./Save path.
overwritestrFalseSet to True if you want to download the dataset again even if it is already saved on your machine.