# Understanding How to Add Datasets to Qualia: A Comprehensive Guide When adding a new dataset to Qualia, we're creating a bridge between raw data files and Qualia's machine learning pipeline. Think of it like building a translator that takes your data and makes it speak Qualia's language. Let's understand this process deeply, using MNIST as our learning example. ## Summary Start by creating a new Python module in the `dataset` folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called `MyNewDataset.py` in this example. Inside this module, create a `MyNewDataset` class that inherits from `RawDataset`. Adapt the `__call__` method to load your data and return the appropriate objects described below. ## First, Let's Understand What We're Building Before writing any code, we need to understand what a dataset class does in Qualia. Think of it as a factory that: 1. Takes raw data files as input 2. Processes them into a standard format 3. Delivers them in a way that Qualia can understand and use Let's look at each component and understand why it's needed. ## The Building Blocks: Understanding Each Method ### 1. The Dataset Class Structure ```python from __future__ import annotations import logging import sys from pathlib import Path import numpy as np from qualia_core.datamodel.RawDataModel import RawData, RawDataSets, RawDataModel from qualia_core.dataset.RawDataset import RawDataset logger = logging.getLogger(__name__) class MNIST(RawDataset): """MNIST handwritten digits dataset.""" ``` Let's understand each import and why we need it: - `annotations`: Enables using class names in type hints before they're defined - `logging`: For keeping track of what our dataset is doing - `Path`: Makes file handling consistent across operating systems - `numpy`: For efficient array operations on our data - `RawDataset`: The base class that tells Qualia how to interact with our dataset - `RawData`, `RawDataSets`, `RawDataModel`: The containers that Qualia expects ### 2. The Initialization Method: Setting Up Our Dataset Initialize the dataset. Think of this like setting up your workspace before starting work. We need to: 1. Know where to find our data files 2. Decide which version of the data we want 3. Set up our working environment ```python def __init__(self, path: str = '', variant: str = 'raw') -> None: super().__init__() # Set up the basic RawDataset structure self.__path = Path(path) # Convert string path to a proper Path object self.__variant = variant # Store which variant we want to use self.sets.remove('valid') # Tell Qualia we won't use a validation set ``` This method is like preparing your kitchen before cooking: - `path`: Where to find your ingredients (data files) - `variant`: Which recipe you're following (data variant) - `sets.remove('valid')`: Removing tools you won't need (validation set) ### 3. The File Reader: Getting Raw Data ```python def _read_idx_file(self, filepath: Path) -> np.ndarray: """Read IDX file format. This is like knowing how to open and read a specific type of container. IDX files have a special structure: - First 4 bytes: Magic number telling us what's inside - Next few bytes: Tell us the shape of our data - Rest of the file: The actual data """ with filepath.open('rb') as f: # Open in binary mode # The magic number tells us what kind of file this is magic = int.from_bytes(f.read(4), byteorder='big') n_dims = magic % 256 # Extract number of dimensions # Read the size of each dimension dims = [] for _ in range(n_dims): dims.append(int.from_bytes(f.read(4), byteorder='big')) # Read all the data at once and reshape it data = np.frombuffer(f.read(), dtype=np.uint8) data = data.reshape(dims) return data ``` Think of this method like a specialized tool that knows how to: 4. Open a specific type of package (IDX file) 5. Read its label (magic number) 6. Understand its dimensions (shape information) 7. Extract its contents (data) in the right shape ### 4. The Data Processor: Preparing Our Data ```python def _load_data(self, images_file: str, labels_file: str) -> tuple[np.ndarray, np.ndarray]: """Load and preprocess data files. This is where we: 1. Read our raw data files 2. Format them how Qualia expects 3. Make sure values are in the right range It's like taking ingredients and preparing them for cooking: - Reading the files is like getting ingredients from containers - Reshaping is like cutting them to the right size - Normalizing is like measuring out the right amounts """ images = self._read_idx_file(self.__path / images_file) labels = self._read_idx_file(self.__path / labels_file) # Format images to [N, H, W, C] shape and normalize to [0,1] # - N: number of images # - H: height (28) # - W: width (28) # - C: channels (1 for grayscale) images = images.reshape(-1, 28, 28, 1).astype(np.float32) / 255.0 return images, labels ``` This method is like your prep cook: 8. Gets raw ingredients (reads files) 9. Prepares them in the right format (reshapes arrays) 10. Measures them correctly (normalizes values) ### 5. The Main Method: Putting It All Together ```python def __call__(self) -> RawDataModel: """Load and prepare the complete dataset. This is our main kitchen where we: 1. Load all our data 2. Organize it into training and test sets 3. Package it in Qualia's preferred containers 4. Add helpful information for debugging """ logger.info('Loading MNIST dataset from %s', self.__path) # Load and prepare training and test data train_x, train_y = self._load_data('train-images-idx3-ubyte', 'train-labels-idx1-ubyte') test_x, test_y = self._load_data('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte') # Log shapes so we can verify everything looks right logger.info('Shapes: train_x=%s, train_y=%s, test_x=%s, test_y=%s', train_x.shape, train_y.shape, test_x.shape, test_y.shape) # Package everything in Qualia's containers return RawDataModel( sets=RawDataSets( train=RawData(train_x, train_y), test=RawData(test_x, test_y) ), name=self.name ) ``` This is like the head chef that: 1. Coordinates all the preparation steps 2. Ensures quality control (logging) 3. Plates the final dish (returns RawDataModel) ## Using Your Dataset Now that we've built our dataset class, we need to: 4. Register it in `__init__.py`: ```python from .MNIST import MNIST # Tell Qualia about our new dataset ``` 5. Create a configuration file (`config.toml`): ```toml [dataset] kind = "MNIST" # Which dataset to use params.path = "data/mnist" # Where to find the data params.variant = "raw" # Which variant to use [[preprocessing]] kind = "Class2BinMatrix" # Convert number labels to one-hot vectors ``` The configuration file is like a recipe that tells Qualia: - What dataset to use - Where to find the data - How to process it ## Testing Your Dataset Always test your dataset before using it in training: 6. Basic loading test: ```python dataset = MNIST(path="test_data") data = dataset() # Verify shapes print(f"Training data shape: {data.sets.train.x.shape}") print(f"Training labels shape: {data.sets.train.y.shape}") ``` 7. Full pipeline test: ```bash qualia ./config.toml preprocess_data ``` These tests help ensure your dataset will work correctly in the full Qualia pipeline. ## Understanding Common Issues When implementing a dataset, you might encounter several common challenges: 8. File Reading Issues: - Wrong file paths - Incorrect file format reading - Memory problems with large files 9. Data Format Issues: - Wrong array shapes - Incorrect normalization - Type mismatches 10. Memory Issues: - Loading too much data at once - Not cleaning up temporary arrays - Using inefficient data types Always add proper error handling and logging to help diagnose these issues. Would you like me to elaborate on any part of this explanation?