# Use Custom Datasets

If you want to use a custom dataset while also reusing medsegpy's data loaders,
you will need to

1. Perform data augmentation (optional)
2. Store data in a medsegpy friendly way.
3. Register metadata for you dataset (i.e., tell medsegpy how to obtain your dataset).

### Data Augmentation
Currently, data augmentation is not done by default in medsegpy dataloaders. Data augmentation can optionally be done outside of medsegpy. If augmentations are used, define different augmentations or series of augmentations with a unique numeric identifier. This identifier will be the augmentation number for different scans (see below).

Note that augmentation should only be done on the training data. If augmentations are done on the validation and testing data, medsegpy functionality cannot be guaranteed.

### Data format
Data can be stored as 2D slices or 3D volumes in the hdf5 format. All specifications detailed
below are for compatibility with existing DataLoaders in MedSegPy. If you are designing a
custom dataloader, you may be able to deviate from these specifications.

#### 2D Data
Many medical imaging modalities acquire single-slice acquisitions (CT, Xray, etc.).
Additionally, 3D volumes are often split into 2D slices when training 2D networks to
increase data speeds.

Data stored in the 2D format must follow a specific naming convention:
  * Subject id: 7 digits
  * Timepoint: 2 digits
  * Augmentation Number: 2 digits. Should be `00` for volumes that are not augmented
  * Slice number: 3 digits (1-indexed)

**Readable format:** `SubjectID_Timepoint-AugmentationNumber_SliceNumber`

**String format:** `%07d_V%02d-Aug%02d_%03d`

**Regex:** `([\d]+)_V([\d]+)-Aug([\d]+)_([\d]+)`

**Examples:**
- `0123456_V01-Aug00_001`: Subject 0123456, Timepoint 1, No Augmentation, Slice 1
- `0123456_V00-Aug00_001`: Subject 0123456, Timepoint 0,  No Augmentation, Slice 1
- `0123456_V00-Aug04_001`: Subject 0123456, Timepoint 0, Augmentation 4, Slice 1
- `0123456_V00-Aug00_999`: Subject 0123456, Timepoint 0, Augmentation 4, Slice 999

**Compatible DataLoaders**:
- [`DefaultDataLoader`](../modules/engine.html#medsegpy.data.data_loader.DefaultDataLoader)
- [`S25dDataLoader`](../modules/engine.html#medsegpy.data.data_loader.S25dDataLoader)

The augmentation number is used to keep track of what augmentations are done. 
When naming files, note that slices should start at slice 1.

###### Image files
Image files should end with a `.im` extension. The file should contain a dataset
`data`, which contains a `HxWx1` shaped array corresponding to the slice.

For example, `0123456_V00-Aug00_999.im` contains slice 999 from the volume
`0123456_V00-Aug00`.

###### Segmentation files
Ground truth masks should end with a `.seg` extension. The file should contain a
dataset `data`, which contains a `HxWx1xC` shaped binary array corresponding to the segmentation for the slice. Here, `C` refers to masks for different classes.

For example, `0123456_V00-Aug00_999.seg` contains segmentations for slice 999 from the volume `0123456_V00-Aug00`.

#### 3D Data
Unlike the 2D files, 3D files do not have a particular naming convention. Additionally, 3D files will have both image and segmentation data in a single hdf5 file under different keys: `volume` for image data and `seg` for segmentation masks.

#### Collating segmentations
Segmentations can also be collated (combined) to form segmentations for
superclasses. For example, if segmentations for "dog" and "cat" were stored
at index `0` and `2` in the segmentation file, to segment the both classes as a single class, specify the 
tuple `(0, 2)` as the index to segment.

#### How h5 files are read
Below are examples detailing the hdf5 structure for 2D and 3D data.

```python
import h5py

# ========= 2D Data =========
# Read slice 999 for volume 0123456_V00-Aug00.

# Read image slice.
with h5py.File("0123456_V00-Aug00_999.im") as f:
    image = f["data"][:]  # shape: HxWx1

# Read segmentations.
with h5py.File("0123456_V00-Aug00_999.seg") as f:
    mask = f["data"][:]  # shape: HxWx1xC
    
# ========= 3D Data =========
with h5py.File("0123456_V00.h5") as f:
    volume = f["volume"][:]  # shape: HxWxD
    mask = f["seg"][:]  # shape: HxWxDxC
```

#### Data paths
Data is often split into training, validation, and testing data. Each split
should be in a different directory. Image and segmentation files should be stored in the appropriate folder.

### Register Dataset
To let medsegpy know how to obtain a dataset named "my_dataset", you will impolement a function that
returns the items in your dataset and then tell medsegpy about this function

```python
def get_dicts():
    ...
    return list[dict] in the following format

from medsegpy.data import DatasetCatalog
DatasetCatalog.register("my_dataset", get_dicts)
```

Here, the snippet associates "my_dataset" with a function that returns the data. The registration
is effective as long as the process is running.

The function can process data from its original format into either one of the following:

1. MedSegPy's standard dataset dict, described below. This will work with many builtin features
in MedSegPy, so it is recommended when it is sufficient for your task.
2. Your custom dataset dict. You can choose to return arbitrary dicts that are designed to work
with your custom dataloader.

#### Standard Dataset Dicts
For standard semantic segmentation tasks, we load the original dataset into `list[dict]`.
Each dictionary is required to contain a set of keys. Because all data must currently be stored
as 2D slices in the h5 format (as described above), the following keys are required for all
dictionaries:

+ `file_name` (str): the full path to the image file for this slice.
+ `sem_seg_file` (str): the full path to the semantic segmentation file for this slice.
+ `scan_id` (str): the scan this slice belongs to
+ `slice_id` (int): The slice this file corresponds to. Should be 1-indexed, meaning the
first slice of every volume has `slice_id=1`.
+ `scan_num_slices`: the total number of slices in the scan volume.

The following keys are optional:
+ `subject_id` (int): the subject id corresponding to this scan
+ `time_point` (int): the time point at which the scan was acquired

All keys begining with `scan_` will be interpreted as special keys unique to the scan.
These will be returned as part of the input dictionary during inference.
In built-in medsegpy functions, these keys will also serve as override keys for any default
metadata associated with the dataset (see metadata section below). For example, if the dataset has a metadata key
`spacing`, the value for `spacing` is typically used for all elements in the dataset.
However, if a dataset dictionary has the key `scan_spacing`, the value of `scan_spacing` will
override the default metadata value.

MedSegPy will be expanding to support data stored in 3D soon. We will update
this section once that is complete.

#### Examples
For examples on registering datasets, see 
[datasets/oai.py](../modules/engine.html#medsegpy.data.datasets.oai)

### Register Metadata for a Dataset

To let medsegpy know how to obtain a dataset named "my_dataset", you will need
to add metadata for your specific dataset. Metadata names and types are shown
below.

Required:
+ `scan_root` (`str`): the directory path where images/segmentation files for the dataset are stored.
+ `category_ids` (sequence of `int` or `tuple[int]`): Category ids corresponding to different classes. Supports segmentation collating.
+ `categories` (sequence of `str`): Sequence of category names. 1-to-1 with `category_ids`.
+ `category_id_to_contiguous_id` (dict of `int/tuple[int]`->`int`): Maps
category ids to contiguous ids (0-indexed).
+ `evaluator_type` (`str`): value should be `"SemSegEvaluator"`

Optional:
+ `spacing` (tuple of `float`): the spacing in millimeters for scan volumes `(dH, dW, ...)`. Required for some segmentation metrics.
+ `category_abbreviations` (sequence of `str`): Abbreviations for categories.
1-to-1 with `categories`.
+ `category_colors` (sequence of `(R,G,B)`): R,G,B colors for different categories.

For data that is split into train/val/test splits, each split should be registered as a different dataset.

Below is an example for registering with training, validation, and testing splits. Segmentations for this data have 4 classes (in order): dog, human, cat, tree. We also want to collate the `dog` and `cat` categories into a new category `pet`.

```python
from medsegpy.data import MetadataCatalog
category_info = [
    {"id": 0, "name": "dog"},
    {"id": 1, "name": "human"},
    {"id": 2, "name": "cat"},
    {"id": 3, "name": "tree"},
    {"id": (0,2), "name": "pet"},  # collate "dog" & "cat" into "pet"
]
datasets_to_path = {
    "my_dataset_train": "/path/to/train/split",
    "my_dataset_val": "/path/to/val/split",
    "my_dataset_test": "/path/to/test/split",
}

for dataset_name, scan_root in datasets_to_path.items():
    MetadataCatalog.get(dataset_name).set(
        scan_root=scan_root,
        category_ids=[x["id"] for x in category_info],
        categories=[x["name"] for x in category_info],
        category_id_to_contiguous_id={
            x["id"]: idx for idx, x in enumerate(OAI_CATEGORIES)
        },
        evaluator_type="SemSegEvaluator",
    )
```

### Update the Config for New Datasets

Once you've registered the dataset, you can use the name of the dataset (e.g., "my_dataset" in
example above) in `{TRAIN,VAL,TEST}_DATASET`.