# Understanding How to Add Datasets to Qualia: A Comprehensive Guide

When adding a new dataset to Qualia, we're creating a bridge between raw data files and Qualia's machine learning pipeline. Think of it like building a translator that takes your data and makes it speak Qualia's language. Let's understand this process deeply, using MNIST as our learning example.

## Summary
Start by creating a new Python module in the `dataset` folder of the code base (Qualia-Core or a Qualia-Plugin source folder), called `MyNewDataset.py` in this example.
Inside this module, create a `MyNewDataset` class that inherits from `RawDataset`.
Adapt the `__call__` method to load your data and return the appropriate objects described below.
## First, Let's Understand What We're Building

Before writing any code, we need to understand what a dataset class does in Qualia. Think of it as a factory that:
1. Takes raw data files as input
2. Processes them into a standard format
3. Delivers them in a way that Qualia can understand and use

Let's look at each component and understand why it's needed.

## The Building Blocks: Understanding Each Method

### 1. The Dataset Class Structure

```python
from __future__ import annotations
import logging
import sys
from pathlib import Path
import numpy as np
from qualia_core.datamodel.RawDataModel import RawData, RawDataSets, RawDataModel
from qualia_core.dataset.RawDataset import RawDataset

logger = logging.getLogger(__name__)

class MNIST(RawDataset):
    """MNIST handwritten digits dataset."""
```

Let's understand each import and why we need it:
- `annotations`: Enables using class names in type hints before they're defined
- `logging`: For keeping track of what our dataset is doing
- `Path`: Makes file handling consistent across operating systems
- `numpy`: For efficient array operations on our data
- `RawDataset`: The base class that tells Qualia how to interact with our dataset
- `RawData`, `RawDataSets`, `RawDataModel`: The containers that Qualia expects

### 2. The Initialization Method: Setting Up Our Dataset
Initialize the dataset.
Think of this like setting up your workspace before starting work.
We need to:
1. Know where to find our data files
2. Decide which version of the data we want
3. Set up our working environment
```python
def __init__(self, path: str = '', variant: str = 'raw') -> None:
    
    super().__init__()  # Set up the basic RawDataset structure
    self.__path = Path(path)  # Convert string path to a proper Path object
    self.__variant = variant  # Store which variant we want to use
    self.sets.remove('valid')  # Tell Qualia we won't use a validation set
```

This method is like preparing your kitchen before cooking:
- `path`: Where to find your ingredients (data files)
- `variant`: Which recipe you're following (data variant)
- `sets.remove('valid')`: Removing tools you won't need (validation set)

### 3. The File Reader: Getting Raw Data

```python
def _read_idx_file(self, filepath: Path) -> np.ndarray:
    """Read IDX file format.
    
    This is like knowing how to open and read a specific type of container.
    IDX files have a special structure:
    - First 4 bytes: Magic number telling us what's inside
    - Next few bytes: Tell us the shape of our data
    - Rest of the file: The actual data
    """
    with filepath.open('rb') as f:  # Open in binary mode
        # The magic number tells us what kind of file this is
        magic = int.from_bytes(f.read(4), byteorder='big')
        n_dims = magic % 256  # Extract number of dimensions
        
        # Read the size of each dimension
        dims = []
        for _ in range(n_dims):
            dims.append(int.from_bytes(f.read(4), byteorder='big'))
        
        # Read all the data at once and reshape it
        data = np.frombuffer(f.read(), dtype=np.uint8)
        data = data.reshape(dims)
        
        return data
```

Think of this method like a specialized tool that knows how to:
4. Open a specific type of package (IDX file)
5. Read its label (magic number)
6. Understand its dimensions (shape information)
7. Extract its contents (data) in the right shape

### 4. The Data Processor: Preparing Our Data

```python
def _load_data(self, images_file: str, labels_file: str) -> tuple[np.ndarray, np.ndarray]:
    """Load and preprocess data files.
    
    This is where we:
    1. Read our raw data files
    2. Format them how Qualia expects
    3. Make sure values are in the right range
    
    It's like taking ingredients and preparing them for cooking:
    - Reading the files is like getting ingredients from containers
    - Reshaping is like cutting them to the right size
    - Normalizing is like measuring out the right amounts
    """
    images = self._read_idx_file(self.__path / images_file)
    labels = self._read_idx_file(self.__path / labels_file)
    
    # Format images to [N, H, W, C] shape and normalize to [0,1]
    # - N: number of images
    # - H: height (28)
    # - W: width (28)
    # - C: channels (1 for grayscale)
    images = images.reshape(-1, 28, 28, 1).astype(np.float32) / 255.0
    
    return images, labels
```

This method is like your prep cook:
8. Gets raw ingredients (reads files)
9. Prepares them in the right format (reshapes arrays)
10. Measures them correctly (normalizes values)

### 5. The Main Method: Putting It All Together

```python
def __call__(self) -> RawDataModel:
    """Load and prepare the complete dataset.
    
    This is our main kitchen where we:
    1. Load all our data
    2. Organize it into training and test sets
    3. Package it in Qualia's preferred containers
    4. Add helpful information for debugging
    """
    logger.info('Loading MNIST dataset from %s', self.__path)
    
    # Load and prepare training and test data
    train_x, train_y = self._load_data('train-images-idx3-ubyte', 'train-labels-idx1-ubyte')
    test_x, test_y = self._load_data('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte')
    
    # Log shapes so we can verify everything looks right
    logger.info('Shapes: train_x=%s, train_y=%s, test_x=%s, test_y=%s',
               train_x.shape, train_y.shape, test_x.shape, test_y.shape)
    
    # Package everything in Qualia's containers
    return RawDataModel(
        sets=RawDataSets(
            train=RawData(train_x, train_y),
            test=RawData(test_x, test_y)
        ),
        name=self.name
    )
```

This is like the head chef that:
1. Coordinates all the preparation steps
2. Ensures quality control (logging)
3. Plates the final dish (returns RawDataModel)

## Using Your Dataset

Now that we've built our dataset class, we need to:

4. Register it in `__init__.py`:
```python
from .MNIST import MNIST  # Tell Qualia about our new dataset
```

5. Create a configuration file (`config.toml`):
```toml
[dataset]
kind = "MNIST"              # Which dataset to use
params.path = "data/mnist"  # Where to find the data
params.variant = "raw"      # Which variant to use

[[preprocessing]]
kind = "Class2BinMatrix"    # Convert number labels to one-hot vectors
```

The configuration file is like a recipe that tells Qualia:
- What dataset to use
- Where to find the data
- How to process it

## Testing Your Dataset

Always test your dataset before using it in training:

6. Basic loading test:
```python
dataset = MNIST(path="test_data")
data = dataset()

# Verify shapes
print(f"Training data shape: {data.sets.train.x.shape}")
print(f"Training labels shape: {data.sets.train.y.shape}")
```

7. Full pipeline test:
```bash
qualia ./config.toml preprocess_data
```

These tests help ensure your dataset will work correctly in the full Qualia pipeline.

## Understanding Common Issues

When implementing a dataset, you might encounter several common challenges:

8. File Reading Issues:
   - Wrong file paths
   - Incorrect file format reading
   - Memory problems with large files

9. Data Format Issues:
   - Wrong array shapes
   - Incorrect normalization
   - Type mismatches

10. Memory Issues:
   - Loading too much data at once
   - Not cleaning up temporary arrays
   - Using inefficient data types

Always add proper error handling and logging to help diagnose these issues.

Would you like me to elaborate on any part of this explanation?