Understanding How to Add Datasets to Qualia: A Comprehensive Guide¶
When adding a new dataset to Qualia, we’re creating a bridge between raw data files and Qualia’s machine learning pipeline. Think of it like building a translator that takes your data and makes it speak Qualia’s language. Let’s understand this process deeply, using MNIST as our learning example.
Summary¶
Start by creating a new Python module in the dataset
folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called MyNewDataset.py
in this example.
Inside this module, create a MyNewDataset
class that inherits from RawDataset
.
Adapt the __call__
method to load your data and return the appropriate objects described below.
First, Let’s Understand What We’re Building¶
Before writing any code, we need to understand what a dataset class does in Qualia. Think of it as a factory that:
Takes raw data files as input
Processes them into a standard format
Delivers them in a way that Qualia can understand and use
Let’s look at each component and understand why it’s needed.
The Building Blocks: Understanding Each Method¶
1. The Dataset Class Structure¶
from __future__ import annotations
import logging
import sys
from pathlib import Path
import numpy as np
from qualia_core.datamodel.RawDataModel import RawData, RawDataSets, RawDataModel
from qualia_core.dataset.RawDataset import RawDataset
logger = logging.getLogger(__name__)
class MNIST(RawDataset):
"""MNIST handwritten digits dataset."""
Let’s understand each import and why we need it:
annotations
: Enables using class names in type hints before they’re definedlogging
: For keeping track of what our dataset is doingPath
: Makes file handling consistent across operating systemsnumpy
: For efficient array operations on our dataRawDataset
: The base class that tells Qualia how to interact with our datasetRawData
,RawDataSets
,RawDataModel
: The containers that Qualia expects
2. The Initialization Method: Setting Up Our Dataset¶
Initialize the dataset. Think of this like setting up your workspace before starting work. We need to:
Know where to find our data files
Decide which version of the data we want
Set up our working environment
def __init__(self, path: str = '', variant: str = 'raw') -> None:
super().__init__() # Set up the basic RawDataset structure
self.__path = Path(path) # Convert string path to a proper Path object
self.__variant = variant # Store which variant we want to use
self.sets.remove('valid') # Tell Qualia we won't use a validation set
This method is like preparing your kitchen before cooking:
path
: Where to find your ingredients (data files)variant
: Which recipe you’re following (data variant)sets.remove('valid')
: Removing tools you won’t need (validation set)
3. The File Reader: Getting Raw Data¶
def _read_idx_file(self, filepath: Path) -> np.ndarray:
"""Read IDX file format.
This is like knowing how to open and read a specific type of container.
IDX files have a special structure:
- First 4 bytes: Magic number telling us what's inside
- Next few bytes: Tell us the shape of our data
- Rest of the file: The actual data
"""
with filepath.open('rb') as f: # Open in binary mode
# The magic number tells us what kind of file this is
magic = int.from_bytes(f.read(4), byteorder='big')
n_dims = magic % 256 # Extract number of dimensions
# Read the size of each dimension
dims = []
for _ in range(n_dims):
dims.append(int.from_bytes(f.read(4), byteorder='big'))
# Read all the data at once and reshape it
data = np.frombuffer(f.read(), dtype=np.uint8)
data = data.reshape(dims)
return data
Think of this method like a specialized tool that knows how to: 4. Open a specific type of package (IDX file) 5. Read its label (magic number) 6. Understand its dimensions (shape information) 7. Extract its contents (data) in the right shape
4. The Data Processor: Preparing Our Data¶
def _load_data(self, images_file: str, labels_file: str) -> tuple[np.ndarray, np.ndarray]:
"""Load and preprocess data files.
This is where we:
1. Read our raw data files
2. Format them how Qualia expects
3. Make sure values are in the right range
It's like taking ingredients and preparing them for cooking:
- Reading the files is like getting ingredients from containers
- Reshaping is like cutting them to the right size
- Normalizing is like measuring out the right amounts
"""
images = self._read_idx_file(self.__path / images_file)
labels = self._read_idx_file(self.__path / labels_file)
# Format images to [N, H, W, C] shape and normalize to [0,1]
# - N: number of images
# - H: height (28)
# - W: width (28)
# - C: channels (1 for grayscale)
images = images.reshape(-1, 28, 28, 1).astype(np.float32) / 255.0
return images, labels
This method is like your prep cook: 8. Gets raw ingredients (reads files) 9. Prepares them in the right format (reshapes arrays) 10. Measures them correctly (normalizes values)
5. The Main Method: Putting It All Together¶
def __call__(self) -> RawDataModel:
"""Load and prepare the complete dataset.
This is our main kitchen where we:
1. Load all our data
2. Organize it into training and test sets
3. Package it in Qualia's preferred containers
4. Add helpful information for debugging
"""
logger.info('Loading MNIST dataset from %s', self.__path)
# Load and prepare training and test data
train_x, train_y = self._load_data('train-images-idx3-ubyte', 'train-labels-idx1-ubyte')
test_x, test_y = self._load_data('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte')
# Log shapes so we can verify everything looks right
logger.info('Shapes: train_x=%s, train_y=%s, test_x=%s, test_y=%s',
train_x.shape, train_y.shape, test_x.shape, test_y.shape)
# Package everything in Qualia's containers
return RawDataModel(
sets=RawDataSets(
train=RawData(train_x, train_y),
test=RawData(test_x, test_y)
),
name=self.name
)
This is like the head chef that:
Coordinates all the preparation steps
Ensures quality control (logging)
Plates the final dish (returns RawDataModel)
Using Your Dataset¶
Now that we’ve built our dataset class, we need to:
Register it in
__init__.py
:
from .MNIST import MNIST # Tell Qualia about our new dataset
Create a configuration file (
config.toml
):
[dataset]
kind = "MNIST" # Which dataset to use
params.path = "data/mnist" # Where to find the data
params.variant = "raw" # Which variant to use
[[preprocessing]]
kind = "Class2BinMatrix" # Convert number labels to one-hot vectors
The configuration file is like a recipe that tells Qualia:
What dataset to use
Where to find the data
How to process it
Testing Your Dataset¶
Always test your dataset before using it in training:
Basic loading test:
dataset = MNIST(path="test_data")
data = dataset()
# Verify shapes
print(f"Training data shape: {data.sets.train.x.shape}")
print(f"Training labels shape: {data.sets.train.y.shape}")
Full pipeline test:
qualia ./config.toml preprocess_data
These tests help ensure your dataset will work correctly in the full Qualia pipeline.
Understanding Common Issues¶
When implementing a dataset, you might encounter several common challenges:
File Reading Issues:
Wrong file paths
Incorrect file format reading
Memory problems with large files
Data Format Issues:
Wrong array shapes
Incorrect normalization
Type mismatches
Memory Issues:
Loading too much data at once
Not cleaning up temporary arrays
Using inefficient data types
Always add proper error handling and logging to help diagnose these issues.
Would you like me to elaborate on any part of this explanation?