Adding New Datasets to Qualia¶
Base Structure¶
Start by creating a new Python module in the dataset
folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called MyNewDataset.py
in this example.
Inside this module, create a MyNewDataset
class that inherits from RawDataset
.
Adapt the __call__
method to load your data and return the appropriate objects described below.
Here’s a complete example showing the essential structure:
from __future__ import annotations
import sys
import logging
import numpy as np
from pathlib import Path
from qualia_core.datamodel import RawDataModel
from qualia_core.datamodel.RawDataModel import RawData, RawDataSets
if sys.version_info >= (3, 12):
from typing import override
else:
from typing_extensions import override
logger = logging.getLogger(__name__)
class MyNewDataset(RawDataset):
def __init__(self, path: str) -> None:
super().__init__()
self.__path = Path(path)
# Remove validation set if not needed
self.sets.remove('valid')
@override
def __call__(self) -> RawDataModel:
# Load your data files here
# Example with numpy arrays:
train_x = np.load(self.__path / 'train_x.npy') # Shape: [N, S, C] for 1D or [N, H, W, C] for 2D
train_y = np.load(self.__path / 'train_y.npy') # Shape: [N] for class numbers
test_x = np.load(self.__path / 'test_x.npy')
test_y = np.load(self.__path / 'test_y.npy')
# Create RawData objects for each set
train = RawData(train_x, train_y)
test = RawData(test_x, test_y)
# Return the complete model
return RawDataModel(
sets=RawDataSets(train=train, test=test),
name=self.name
)
Core Data Structures¶
RawData¶
Represents a single dataset partition
Contains:
x
: Input data as numpy.ndarrayy
: Ground truth labels as numpy.ndarray
Provides methods for importing/exporting in compressed format (useful for saving/loading preprocessed datasets)
RawDataSets¶
Groups dataset partitions together
Contains:
train
: Training set (RawData)test
: Test set (RawData)valid
: Validation set (RawData, optional)
RawDataModel¶
Top-level container returned by dataset’s
__call__
methodContains:
sets
: RawDataSets objectname
: Dataset name
Expected Data Dimensions¶
1D Data (e.g., time series)¶
Input shape:
[N, S, C]
N: Number of input data
S: Time samples
C: Channels
2D Data (e.g., images)¶
Input shape:
[N, H, W, C]
N: Number of input data
H: Height
W: Width
C: Channels
Ground Truth (Labels)¶
Classification:
Option 1: Class numbers as integers
[N]
(use preprocessing.Class2BinMatrix later for one-hot encoding)Option 2: One-hot encoded matrix
[N, num_classes]
Configuration and Parameters¶
Parameters can be declared in the constructor and set via configuration file, e.g.:
def __init__(self, path: str = '', dtype: str = 'float32') -> None:
super().__init__()
self.__path = Path(path)
self.__dtype = dtype
Configuration file (conf/mynewdataset/config.toml
):
[dataset]
kind = "MyNewDataset"
params.path = "data/mynewdataset"
params.dtype = "float32"
Final Steps¶
After creating your dataset class, import it in dataset/__init__.py
:
from .MyNewDataset import MyNewDataset
__all__ = [..., 'MyNewDataset']