hezar.data.datasets.dataset module

class hezar.data.datasets.dataset.Dataset(config: DatasetConfig, split: str = 'train', preprocessor: str | Preprocessor | PreprocessorsContainer | None = None, **kwargs)[source]

Bases: Dataset

Base class for all datasets in Hezar.

Parameters:
  • config – The configuration object for the dataset.

  • split – Dataset split name e.g, train, test, validation, etc.

  • preprocessor – Preprocessor object or path (note that Hezar datasets classes require this argument).

  • **kwargs – Additional keyword arguments.

required_backends

List of required backends for the dataset.

Type:

List[str | Backends]

config_filename

Default dataset config file name.

Type:

str

cache_dir

Default cache directory for the dataset.

Type:

str

cache_dir = '/home/runner/.cache/hezar/datasets'
config_filename = 'dataset_config.yaml'
static create_preprocessor(preprocessor: str | Preprocessor | PreprocessorsContainer)[source]

Create the preprocessor for the dataset.

Parameters:

preprocessor (str | Preprocessor | PreprocessorsContainer) – Preprocessor for the dataset

classmethod load(hub_path: str | PathLike, split: str | SplitType | None = None, preprocessor: str | Preprocessor | PreprocessorsContainer | None = None, config: DatasetConfig | None = None, config_filename: str | None = None, cache_dir: str | None = None, **kwargs) Dataset[source]

Load the dataset from a hub path.

Parameters:
  • hub_path (str | os.PathLike) – Path to dataset from hub or locally.

  • split (Optional[str | SplitType]) – Dataset split, defaults to “train”.

  • preprocessor (str | Preprocessor | PreprocessorsContainer) – Preprocessor object for the dataset

  • config – (DatasetConfig): A config object to ignore the config in the repo or in case the repo has no dataset_config.yaml file

  • config_filename (Optional[str]) – Dataset config file name. Falls back to dataset_config.yaml if not given.

  • cache_dir (str) – Path to cache directory, defaults to Hezar’s cache directory

  • **kwargs – Config parameters as keyword arguments.

Returns:

An instance of the loaded dataset.

Return type:

Dataset

required_backends: List[str | Backends] = [Backends.DATASETS]