objects

Datasets

Datasets is a designated place to store your Machine Learning datasets within a project. They can be easily shared within a project, used in apps and tracked.

Upload

Datasets can take the form of a directory or a single file. Suppose you have a dataset saved on your local machine with a set of images ./dataset/*.png, then you can push the dataset from a run context with

from pyhectiqlab import Run
run = Run(name="Push images dataset", project="lab/demo")

dataset_dir = "./datasets/"
run.add_dataset(dataset_dir, name="cat-images", push_dir=True)

Your dataset is now uploaded to the lab. Dataset upload is made sync on the main thread.

Download

The download code snippets are always available prefilled from the web application, in the tab Usage.

You can download any datasets from a project you have access to. You can download datasets using the python client or the command line client.

Python client

If you choose to download the dataset from the python client, you can choose to use a run context such as

run.download_dataset("cat-images")

When downloaded from within a run context, the run is automatically attached to the dataset. In other words, in the run page, a link to the dataset will be made available. Eventually, we plan to integrate the full dataset-model-run flow to help users understand their experiments. For this reason, we recommend using run.download_dataset to prepare the flow diagram.

The second option is to download it outside a run:

from pyhectiqlab.datasets import download_dataset
download_dataset(dataset_name='cat-images', project_path='lab/demo')

Command line

The third option is to use the command line

hectiqlab download-dataset -p lab/demo -n cat-images

For all download methods, other parameters are available (dataset version, save path, overwrite). See methods details.

When you download a dataset, a directory is created with the format {name}-{version}. Also with the dataset files, a HLab_README.md is created with meta information and a hidden file .hlab_meta.json.

Record usage

It is often useful to track the existing dataset usage. You can use run.add_dataset_usage and run.add_dataset_usage_from_dirpath to keep track of this usage. The former takes the name and version as arguments while the latter infers the name and version from the path directory where it's been saved. Using these methods help build the flow diagram of your practice.

from pyhectiqlab import Run
run = Run(name="Linear regression prediction", project="lab/demo")

dataset_path = "./cat-images-1.0.0"
run.add_dataset_usage_from_dirpath(dataset_path)

# Equivalent
run.add_dataset_usage("cat-images", version="1.0.0")

Methods

pyhectiqlab.Run

Run.add_dataset(source_path: str, name: str, version: str = None, description: str = None, push_dir: bool = False, resume_upload: bool = False)

Upload a dataset to the lab.

Parameters

Property	Type	Default	Description
`source_path`	`str`	`-`	Path to the dataset file or directory.
`name`	`str`	`-`	The dataset name. The lab will use a slugified version of the name.
`version`	`str`	`None`	Version in format '{major}.{minor}.{micro}' (e.g., 1.2.0). If `None`, the version 1.0.0 is assigned or an increment of minor of the latest version of the dataset with this name (e.g. 1.3.3 -> 1.4.0)
`description`	`str`	`None`	An optional short description of the dataset. For larger description, push a README.md file in the root path.
`push_dir`	`bool`	`False`	If the `source_path` is a directory, use `push_dir=True` to confirm that you understand that a directory will be uploaded.
`resume_upload`	`bool`	`False`	If `True`, you'll be able to push files on an existing dataset version.

Run.download_dataset(dataset_name: str, version: str = None, save_path: str = ./, overwrite: bool = False)

Download an existing dataset from a run. Returns the path to the saved dataset.

Parameters

Property	Type	Default	Description
`dataset_name`	`str`	`-`	Name of the dataset.
`version`	`str`	`None`	Specific version of the dataset. If `None`, the latest version is fetched.
`save_path`	`str`	`./`	Path to where the dataset will be saved.
`overwrite`	`bool`	`False`	Set to `True` if you want to download the dataset again even if it is already saved on your machine.

Run.add_dataset_usage_from_dirpath(dirpath: str)

Attach a run to an existing dataset saved on your machine. The name and version of the dataset is inferred from the directory.

Parameters

Property	Type	Default	Description
`dirpath`	`str`	`-`	Path to where the dataset is saved.

Run.add_dataset_usage(name: str, version: str)

Attach the run to an existing dataset.

Parameters

Property	Type	Default	Description
`name`	`str`	`-`	Dataset name
`version`	`str`	`-`	Dataset version

pyhectiqlab

pyhectiqlab.datasets.download_dataset(dataset_name: str, project: str, version: str, save_path: str = ./, overwrite: str = False)

Download an existing dataset without a run context.

Parameters

Property	Type	Default	Description
`dataset_name`	`str`	`-`	Dataset name
`project`	`str`	`-`	Project name
`version`	`str`	`-`	Specific version of the dataset. If `None`, the latest version is fetched.
`save_path`	`str`	`./`	Save path.
`overwrite`	`str`	`False`	Set to `True` if you want to download the dataset again even if it is already saved on your machine.