Adding New Datasets to Qualia

Base Structure

Start by creating a new Python module in the dataset folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called MyNewDataset.py in this example. Inside this module, create a MyNewDataset class that inherits from RawDataset.

Adapt the __call__ method to load your data and return the appropriate objects described below.

Here’s a complete example showing the essential structure:

from __future__ import annotations
import sys
import logging
import numpy as np
from pathlib import Path
from qualia_core.datamodel import RawDataModel
from qualia_core.datamodel.RawDataModel import RawData, RawDataSets

if sys.version_info >= (3, 12):
    from typing import override
else:
    from typing_extensions import override

logger = logging.getLogger(__name__)

class MyNewDataset(RawDataset):
    def __init__(self, path: str) -> None:
        super().__init__()
        self.__path = Path(path)
        # Remove validation set if not needed
        self.sets.remove('valid')

    @override
    def __call__(self) -> RawDataModel:
        # Load your data files here
        # Example with numpy arrays:
        train_x = np.load(self.__path / 'train_x.npy')  # Shape: [N, S, C] for 1D or [N, H, W, C] for 2D
        train_y = np.load(self.__path / 'train_y.npy')  # Shape: [N] for class numbers
        
        test_x = np.load(self.__path / 'test_x.npy')
        test_y = np.load(self.__path / 'test_y.npy')

        # Create RawData objects for each set
        train = RawData(train_x, train_y)
        test = RawData(test_x, test_y)

        # Return the complete model
        return RawDataModel(
            sets=RawDataSets(train=train, test=test),
            name=self.name
        )

Core Data Structures

RawData

  • Represents a single dataset partition

  • Contains:

    • x: Input data as numpy.ndarray

    • y: Ground truth labels as numpy.ndarray

  • Provides methods for importing/exporting in compressed format (useful for saving/loading preprocessed datasets)

RawDataSets

  • Groups dataset partitions together

  • Contains:

    • train: Training set (RawData)

    • test: Test set (RawData)

    • valid: Validation set (RawData, optional)

RawDataModel

  • Top-level container returned by dataset’s __call__ method

  • Contains:

    • sets: RawDataSets object

    • name: Dataset name

Expected Data Dimensions

1D Data (e.g., time series)

  • Input shape: [N, S, C]

    • N: Number of input data

    • S: Time samples

    • C: Channels

2D Data (e.g., images)

  • Input shape: [N, H, W, C]

    • N: Number of input data

    • H: Height

    • W: Width

    • C: Channels

Ground Truth (Labels)

  • Classification:

    • Option 1: Class numbers as integers [N] (use preprocessing.Class2BinMatrix later for one-hot encoding)

    • Option 2: One-hot encoded matrix [N, num_classes]

Configuration and Parameters

Parameters can be declared in the constructor and set via configuration file, e.g.:

def __init__(self, path: str = '', dtype: str = 'float32') -> None:
    super().__init__()
    self.__path = Path(path)
    self.__dtype = dtype

Configuration file (conf/mynewdataset/config.toml):

[dataset]
kind = "MyNewDataset"
params.path = "data/mynewdataset"
params.dtype = "float32"

Final Steps

After creating your dataset class, import it in dataset/__init__.py:

from .MyNewDataset import MyNewDataset

__all__ = [..., 'MyNewDataset']