Adding New Datasets to Qualia¶

Base Structure¶

Start by creating a new Python module in the dataset folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called MyNewDataset.py in this example. Inside this module, create a MyNewDataset class that inherits from RawDataset.

Adapt the __call__ method to load your data and return the appropriate objects described below.

Here’s a complete example showing the essential structure:

from __future__ import annotations
import sys
import logging
import numpy as np
from pathlib import Path
from qualia_core.datamodel import RawDataModel
from qualia_core.datamodel.RawDataModel import RawData, RawDataSets

if sys.version_info >= (3, 12):
    from typing import override
else:
    from typing_extensions import override

logger = logging.getLogger(__name__)

class MyNewDataset(RawDataset):
    def __init__(self, path: str) -> None:
        super().__init__()
        self.__path = Path(path)
        # Remove validation set if not needed
        self.sets.remove('valid')

    @override
    def __call__(self) -> RawDataModel:
        # Load your data files here
        # Example with numpy arrays:
        train_x = np.load(self.__path / 'train_x.npy')  # Shape: [N, S, C] for 1D or [N, H, W, C] for 2D
        train_y = np.load(self.__path / 'train_y.npy')  # Shape: [N] for class numbers
        
        test_x = np.load(self.__path / 'test_x.npy')
        test_y = np.load(self.__path / 'test_y.npy')

        # Create RawData objects for each set
        train = RawData(train_x, train_y)
        test = RawData(test_x, test_y)

        # Return the complete model
        return RawDataModel(
            sets=RawDataSets(train=train, test=test),
            name=self.name
        )

Core Data Structures¶

RawData¶

Represents a single dataset partition
Contains:
- x: Input data as numpy.ndarray
- y: Ground truth labels as numpy.ndarray
Provides methods for importing/exporting in compressed format (useful for saving/loading preprocessed datasets)

RawDataSets¶

Groups dataset partitions together
Contains:
- train: Training set (RawData)
- test: Test set (RawData)
- valid: Validation set (RawData, optional)

RawDataModel¶

Top-level container returned by dataset’s __call__ method
Contains:
- sets: RawDataSets object
- name: Dataset name

Expected Data Dimensions¶

1D Data (e.g., time series)¶

Input shape: [N, S, C]
- N: Number of input data
- S: Time samples
- C: Channels

2D Data (e.g., images)¶

Input shape: [N, H, W, C]
- N: Number of input data
- H: Height
- W: Width
- C: Channels

Ground Truth (Labels)¶

Classification:
- Option 1: Class numbers as integers [N] (use preprocessing.Class2BinMatrix later for one-hot encoding)
- Option 2: One-hot encoded matrix [N, num_classes]

Configuration and Parameters¶

Parameters can be declared in the constructor and set via configuration file, e.g.:

def __init__(self, path: str = '', dtype: str = 'float32') -> None:
    super().__init__()
    self.__path = Path(path)
    self.__dtype = dtype

Configuration file (conf/mynewdataset/config.toml):

[dataset]
kind = "MyNewDataset"
params.path = "data/mynewdataset"
params.dtype = "float32"

Final Steps¶

After creating your dataset class, import it in dataset/__init__.py:

from .MyNewDataset import MyNewDataset

__all__ = [..., 'MyNewDataset']