Understanding How to Add Datasets to Qualia: A Comprehensive Guide

When adding a new dataset to Qualia, we’re creating a bridge between raw data files and Qualia’s machine learning pipeline. Think of it like building a translator that takes your data and makes it speak Qualia’s language. Let’s understand this process deeply, using MNIST as our learning example.

Summary

Start by creating a new Python module in the dataset folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called MyNewDataset.py in this example. Inside this module, create a MyNewDataset class that inherits from RawDataset. Adapt the __call__ method to load your data and return the appropriate objects described below.

First, Let’s Understand What We’re Building

Before writing any code, we need to understand what a dataset class does in Qualia. Think of it as a factory that:

  1. Takes raw data files as input

  2. Processes them into a standard format

  3. Delivers them in a way that Qualia can understand and use

Let’s look at each component and understand why it’s needed.

The Building Blocks: Understanding Each Method

1. The Dataset Class Structure

from __future__ import annotations
import logging
import sys
from pathlib import Path
import numpy as np
from qualia_core.datamodel.RawDataModel import RawData, RawDataSets, RawDataModel
from qualia_core.dataset.RawDataset import RawDataset

logger = logging.getLogger(__name__)

class MNIST(RawDataset):
    """MNIST handwritten digits dataset."""

Let’s understand each import and why we need it:

  • annotations: Enables using class names in type hints before they’re defined

  • logging: For keeping track of what our dataset is doing

  • Path: Makes file handling consistent across operating systems

  • numpy: For efficient array operations on our data

  • RawDataset: The base class that tells Qualia how to interact with our dataset

  • RawData, RawDataSets, RawDataModel: The containers that Qualia expects

2. The Initialization Method: Setting Up Our Dataset

Initialize the dataset. Think of this like setting up your workspace before starting work. We need to:

  1. Know where to find our data files

  2. Decide which version of the data we want

  3. Set up our working environment

def __init__(self, path: str = '', variant: str = 'raw') -> None:
    
    super().__init__()  # Set up the basic RawDataset structure
    self.__path = Path(path)  # Convert string path to a proper Path object
    self.__variant = variant  # Store which variant we want to use
    self.sets.remove('valid')  # Tell Qualia we won't use a validation set

This method is like preparing your kitchen before cooking:

  • path: Where to find your ingredients (data files)

  • variant: Which recipe you’re following (data variant)

  • sets.remove('valid'): Removing tools you won’t need (validation set)

3. The File Reader: Getting Raw Data

def _read_idx_file(self, filepath: Path) -> np.ndarray:
    """Read IDX file format.
    
    This is like knowing how to open and read a specific type of container.
    IDX files have a special structure:
    - First 4 bytes: Magic number telling us what's inside
    - Next few bytes: Tell us the shape of our data
    - Rest of the file: The actual data
    """
    with filepath.open('rb') as f:  # Open in binary mode
        # The magic number tells us what kind of file this is
        magic = int.from_bytes(f.read(4), byteorder='big')
        n_dims = magic % 256  # Extract number of dimensions
        
        # Read the size of each dimension
        dims = []
        for _ in range(n_dims):
            dims.append(int.from_bytes(f.read(4), byteorder='big'))
        
        # Read all the data at once and reshape it
        data = np.frombuffer(f.read(), dtype=np.uint8)
        data = data.reshape(dims)
        
        return data

Think of this method like a specialized tool that knows how to: 4. Open a specific type of package (IDX file) 5. Read its label (magic number) 6. Understand its dimensions (shape information) 7. Extract its contents (data) in the right shape

4. The Data Processor: Preparing Our Data

def _load_data(self, images_file: str, labels_file: str) -> tuple[np.ndarray, np.ndarray]:
    """Load and preprocess data files.
    
    This is where we:
    1. Read our raw data files
    2. Format them how Qualia expects
    3. Make sure values are in the right range
    
    It's like taking ingredients and preparing them for cooking:
    - Reading the files is like getting ingredients from containers
    - Reshaping is like cutting them to the right size
    - Normalizing is like measuring out the right amounts
    """
    images = self._read_idx_file(self.__path / images_file)
    labels = self._read_idx_file(self.__path / labels_file)
    
    # Format images to [N, H, W, C] shape and normalize to [0,1]
    # - N: number of images
    # - H: height (28)
    # - W: width (28)
    # - C: channels (1 for grayscale)
    images = images.reshape(-1, 28, 28, 1).astype(np.float32) / 255.0
    
    return images, labels

This method is like your prep cook: 8. Gets raw ingredients (reads files) 9. Prepares them in the right format (reshapes arrays) 10. Measures them correctly (normalizes values)

5. The Main Method: Putting It All Together

def __call__(self) -> RawDataModel:
    """Load and prepare the complete dataset.
    
    This is our main kitchen where we:
    1. Load all our data
    2. Organize it into training and test sets
    3. Package it in Qualia's preferred containers
    4. Add helpful information for debugging
    """
    logger.info('Loading MNIST dataset from %s', self.__path)
    
    # Load and prepare training and test data
    train_x, train_y = self._load_data('train-images-idx3-ubyte', 'train-labels-idx1-ubyte')
    test_x, test_y = self._load_data('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte')
    
    # Log shapes so we can verify everything looks right
    logger.info('Shapes: train_x=%s, train_y=%s, test_x=%s, test_y=%s',
               train_x.shape, train_y.shape, test_x.shape, test_y.shape)
    
    # Package everything in Qualia's containers
    return RawDataModel(
        sets=RawDataSets(
            train=RawData(train_x, train_y),
            test=RawData(test_x, test_y)
        ),
        name=self.name
    )

This is like the head chef that:

  1. Coordinates all the preparation steps

  2. Ensures quality control (logging)

  3. Plates the final dish (returns RawDataModel)

Using Your Dataset

Now that we’ve built our dataset class, we need to:

  1. Register it in __init__.py:

from .MNIST import MNIST  # Tell Qualia about our new dataset
  1. Create a configuration file (config.toml):

[dataset]
kind = "MNIST"              # Which dataset to use
params.path = "data/mnist"  # Where to find the data
params.variant = "raw"      # Which variant to use

[[preprocessing]]
kind = "Class2BinMatrix"    # Convert number labels to one-hot vectors

The configuration file is like a recipe that tells Qualia:

  • What dataset to use

  • Where to find the data

  • How to process it

Testing Your Dataset

Always test your dataset before using it in training:

  1. Basic loading test:

dataset = MNIST(path="test_data")
data = dataset()

# Verify shapes
print(f"Training data shape: {data.sets.train.x.shape}")
print(f"Training labels shape: {data.sets.train.y.shape}")
  1. Full pipeline test:

qualia ./config.toml preprocess_data

These tests help ensure your dataset will work correctly in the full Qualia pipeline.

Understanding Common Issues

When implementing a dataset, you might encounter several common challenges:

  1. File Reading Issues:

    • Wrong file paths

    • Incorrect file format reading

    • Memory problems with large files

  2. Data Format Issues:

    • Wrong array shapes

    • Incorrect normalization

    • Type mismatches

  3. Memory Issues:

  • Loading too much data at once

  • Not cleaning up temporary arrays

  • Using inefficient data types

Always add proper error handling and logging to help diagnose these issues.

Would you like me to elaborate on any part of this explanation?