medsegpy.data

medsegpy.data.build

Build dataset dictionaries.

medsegpy.data.build.filter_dataset(dataset_dicts: List[Dict], by: collections.abc.Hashable, accepted_elements, include_missing: bool = False)[source]

Filter by common dataset fields.

Parameters:
  • dataset_dicts (List[Dict]) – data in MedSegPy Dataset format.
  • by (Hashable) – Field to filter by.
  • accepted_elements (Sequence) – Acceptable elements.
  • include_missing (bool, optional) – If True, include elements without by field in dictionary representation.
Returns:

List[Dict] – Filtered dataset dictionaries.

medsegpy.data.build.get_sem_seg_dataset_dicts(dataset_names: Sequence[str], filter_empty: bool = True)[source]

Load and prepare dataset dicts for semantic segmentation.

Parameters:
  • dataset_names (Sequence[str]) – A list of dataset names.
  • filter_empty (bool, optional) – Filter datasets without field ‘sem_seg_file’.

medsegpy.data.catalog

Metadata catalogs for different datasets.

Metadata stores information like directory paths, mapping from class ids to name, etc.

Adopted from Facebook’s detectron2. https://github.com/facebookresearch/detectron2

class medsegpy.data.catalog.DatasetCatalog[source]

A catalog that stores information about the datasets and how to obtain them.

It contains a mapping from strings (which are names that identify a dataset, e.g. “oai_2d_train”) to a function which parses the dataset and returns the samples in the format of list[dict].

The returned dicts should be in MedSegPy Dataset format (See DATASETS.md for details) if used with the data loader functionalities in data/build.py,data/detection_transform.py.

The purpose of having this catalog is to make it easy to choose different datasets, by just using the strings in the config.

static register(func)[source]
Parameters:
  • name (str) – the name that identifies a dataset, e.g. “coco_2014_train”.
  • func (callable) – a callable which takes no arguments and returns a list of dicts.
static get()[source]

Call the registered function and return its results.

Parameters:name (str) – the name that identifies a dataset, e.g. “coco_2014_train”.
Returns:list[dict] – dataset annotations.0
static list() → List[str][source]

List all registered datasets.

Returns:list[str]
static clear()[source]

Remove all registered dataset.

class medsegpy.data.catalog.MetadataCatalog[source]

MetadataCatalog provides access to “Metadata” of a given dataset.

The metadata associated with a certain name is a singleton: once created, the metadata will stay alive and will be returned by future calls to get(name).

It’s like global variables, so don’t abuse it. It’s meant for storing knowledge that’s constant and shared across the execution of the program, e.g.: the class names in OAI iMorphics.

static get()[source]
Parameters:name (str) – name of a dataset (e.g. oai_2d_train).
Returns:Metadata – The Metadata instance associated with this name, or create an empty one if none is available.
static convert_path_to_dataset()[source]

Convert the dataset path to name for legacy code.

This method will be phased out in future versions.

medsegpy.data.data_loader

medsegpy.data.data_loader.build_data_loader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], **kwargs) → medsegpy.data.data_loader.DataLoader[source]

Get data loader based on config TAG or name, value.

class medsegpy.data.data_loader.DataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = True, batch_size: int = 1)[source]

Data loader following keras.utils.Sequence API.

Data loaders load data per batch in the following way: 1. Collate inputs and outputs 2. Optionally apply preprocessing

To avoid changing the order of the base list, we shuffle a list of indices and query based on the index.

Data loaders in medsegpy also have the ability to yield inference results per scan (see inference()).

__init__(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = True, batch_size: int = 1)[source]
Parameters:
  • cfg (Config) – A config object.
  • dataset_dicts (List[Dict]) – List of data in medsegpy dataset format.
  • is_test (bool, optional) – If True, configures loader as a testing/inference loader. This is typically used when running evaluation.
  • shuffle (bool, optional) – If True, shuffle data every epoch.
  • drop_last (bool, optional) – Drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of the dataset is not divisible by batch size, then the last batch will be smaller. This can affect loss calculations.
  • batch_size (int, optional) – Batch size.
__len__()[source]

Number of batches.

By default, each element in the dataset dict is independent.

inference(model, **kwargs)[source]

Yields dictionaries of inputs, outputs per scan.

In medical settings, data is often processed per scan, not necessarily per example. This distinction is critical. For example, a 2D segmentation network may take in 2D slices of a scan as input. However, during inference, it is standard to compute metrics on the full scan, not individual slices.

This method does the following:
  1. Loads dataset dicts corresponding to a scan
  2. Structures data from these dicts
  3. Runs predictions on the structured data
  4. Restructures inputs. Images/volumes are restructured to HxWx…
    Segmentation masks and predictions are restructured to HxWx…xC.
  5. Yield input, output dictionaries for the scan. Yielding continues
    until all scans have been processed.

This method should yield scan-specific inputs and outputs as dictionaries. The following keys should be in the input and output dictionaries for each scan at minimum.

Input keys:
  • “scan_id” (str): the scan identifier
  • “x” (ndarray): the raw (unprocessed) input. Shape HxWx…
    If the network takes multiple inputs, each input should correspond to a unique key that will be handled by your specified evaluator.
  • “scan_XXX” (optional) scan-related parameters that will simplify
    evaluation. e.g. “scan_spacing”. MedSegPy evaluators will default to scan specific information, if provided. For example, if “scan_spacing” is specified, the value specified will override the default spacing for the dataset.
  • “subject_id” (optional): the subject identifier for the scan.
    Useful for grouping results by subject.
Output keys:
  • “time_elapsed” (required): Amount of time required for inference
    on scan. This quantity typically includes data loading time as well.
  • “y_true” (ndarray): Ground truth binary mask for semantic
    segmentation. Shape HxWx…xC. Required for semantic segmentation inference.
  • “y_pred” (ndarray): Prediction probabilities for semantic
    segmentation. Shape HxWx…xC. Required for semantic segmentation inference.

All output keys except “time_elapsed” are optional and task specific.

Parameters:
  • model – A model to run inference on.
  • kwargs – Keyword arguments to model.predict_generator()
Yields:

dict, dict

Dictionaries of inputs and outputs corresponding to a

single scan.

class medsegpy.data.data_loader.DefaultDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1)[source]

The default data loader functionality in medsegy.

This class takes a dataset dict in the MedSegPy 2D Dataset format and maps it to a format that can be used by the model for semantic segmentation.

This is the default data loader.

  1. Read the input matrix from “file_name”
  2. Read the ground truth mask matrix from “sem_seg_file_name”
  3. If needed:
    1. Add binary labels for background
  4. Apply MedTransform transforms to input and masks.
  5. If training, return input (preprocessed), output. If testing, return input (preprocessed), output, input (raw). The testing structure is useful for tracking the original input without any preprocessing. This return structure does not conflict with existing Keras model functionality.
__getitem__(idx)[source]
Parameters:idx – Batch index.
Returns:ndarray, ndarray – images NxHxWx(…)x1, masks NxHxWx(…)x1
class medsegpy.data.data_loader.PatchDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1, use_singlefile: bool = False)[source]

This data loader pre-computes patch locations and padding based on patch size (cfg.IMG_SIZE), pad type (cfg.IMG_PAD_MODE), pad size (cfg.IMG_PAD_SIZE), and stride (cfg.IMG_STRIDE) parameters specified in the config.

Assumptions:
  • all dataset dictionaries have the same image dimensions
  • “image_size” in dataset dict
__getitem__(idx)[source]
Parameters:idx – Batch index.
Returns:ndarray, ndarray – images NxHxWx(…)x1, masks NxHxWx(…)x1
class medsegpy.data.data_loader.N5dDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1)[source]

n.5D data loader.

Use this for 2.5D, 3.5D, etc. implementations. Currently only last dimension is supported as the channel dimension.

class medsegpy.data.data_loader.S25dDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1)[source]

Special case of 2.5D data loader compatible with 2D MedSegPy data format.

Each dataset dict should represent a slice and must have the additional keys: - “slice_id” (int): Slice id (1-indexed) that the dataset corresponds to. - “scan_num_slices” (int): Number of total slices in the scan that the

dataset dict is derived from

Padding is automatically applied to ensure all slices are considered.

This is a temporary solution until the slow loading speeds of the N5dDataLoader are properly debugged.

medsegpy.data.data_utils

medsegpy.data.data_loader.build_data_loader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], **kwargs) → medsegpy.data.data_loader.DataLoader[source]

Get data loader based on config TAG or name, value.

class medsegpy.data.data_loader.DataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = True, batch_size: int = 1)[source]

Data loader following keras.utils.Sequence API.

Data loaders load data per batch in the following way: 1. Collate inputs and outputs 2. Optionally apply preprocessing

To avoid changing the order of the base list, we shuffle a list of indices and query based on the index.

Data loaders in medsegpy also have the ability to yield inference results per scan (see inference()).

__init__(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = True, batch_size: int = 1)[source]
Parameters:
  • cfg (Config) – A config object.
  • dataset_dicts (List[Dict]) – List of data in medsegpy dataset format.
  • is_test (bool, optional) – If True, configures loader as a testing/inference loader. This is typically used when running evaluation.
  • shuffle (bool, optional) – If True, shuffle data every epoch.
  • drop_last (bool, optional) – Drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of the dataset is not divisible by batch size, then the last batch will be smaller. This can affect loss calculations.
  • batch_size (int, optional) – Batch size.
__len__()[source]

Number of batches.

By default, each element in the dataset dict is independent.

inference(model, **kwargs)[source]

Yields dictionaries of inputs, outputs per scan.

In medical settings, data is often processed per scan, not necessarily per example. This distinction is critical. For example, a 2D segmentation network may take in 2D slices of a scan as input. However, during inference, it is standard to compute metrics on the full scan, not individual slices.

This method does the following:
  1. Loads dataset dicts corresponding to a scan
  2. Structures data from these dicts
  3. Runs predictions on the structured data
  4. Restructures inputs. Images/volumes are restructured to HxWx…
    Segmentation masks and predictions are restructured to HxWx…xC.
  5. Yield input, output dictionaries for the scan. Yielding continues
    until all scans have been processed.

This method should yield scan-specific inputs and outputs as dictionaries. The following keys should be in the input and output dictionaries for each scan at minimum.

Input keys:
  • “scan_id” (str): the scan identifier
  • “x” (ndarray): the raw (unprocessed) input. Shape HxWx…
    If the network takes multiple inputs, each input should correspond to a unique key that will be handled by your specified evaluator.
  • “scan_XXX” (optional) scan-related parameters that will simplify
    evaluation. e.g. “scan_spacing”. MedSegPy evaluators will default to scan specific information, if provided. For example, if “scan_spacing” is specified, the value specified will override the default spacing for the dataset.
  • “subject_id” (optional): the subject identifier for the scan.
    Useful for grouping results by subject.
Output keys:
  • “time_elapsed” (required): Amount of time required for inference
    on scan. This quantity typically includes data loading time as well.
  • “y_true” (ndarray): Ground truth binary mask for semantic
    segmentation. Shape HxWx…xC. Required for semantic segmentation inference.
  • “y_pred” (ndarray): Prediction probabilities for semantic
    segmentation. Shape HxWx…xC. Required for semantic segmentation inference.

All output keys except “time_elapsed” are optional and task specific.

Parameters:
  • model – A model to run inference on.
  • kwargs – Keyword arguments to model.predict_generator()
Yields:

dict, dict

Dictionaries of inputs and outputs corresponding to a

single scan.

class medsegpy.data.data_loader.DefaultDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1)[source]

The default data loader functionality in medsegy.

This class takes a dataset dict in the MedSegPy 2D Dataset format and maps it to a format that can be used by the model for semantic segmentation.

This is the default data loader.

  1. Read the input matrix from “file_name”
  2. Read the ground truth mask matrix from “sem_seg_file_name”
  3. If needed:
    1. Add binary labels for background
  4. Apply MedTransform transforms to input and masks.
  5. If training, return input (preprocessed), output. If testing, return input (preprocessed), output, input (raw). The testing structure is useful for tracking the original input without any preprocessing. This return structure does not conflict with existing Keras model functionality.
__getitem__(idx)[source]
Parameters:idx – Batch index.
Returns:ndarray, ndarray – images NxHxWx(…)x1, masks NxHxWx(…)x1
class medsegpy.data.data_loader.PatchDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1, use_singlefile: bool = False)[source]

This data loader pre-computes patch locations and padding based on patch size (cfg.IMG_SIZE), pad type (cfg.IMG_PAD_MODE), pad size (cfg.IMG_PAD_SIZE), and stride (cfg.IMG_STRIDE) parameters specified in the config.

Assumptions:
  • all dataset dictionaries have the same image dimensions
  • “image_size” in dataset dict
__getitem__(idx)[source]
Parameters:idx – Batch index.
Returns:ndarray, ndarray – images NxHxWx(…)x1, masks NxHxWx(…)x1
class medsegpy.data.data_loader.N5dDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1)[source]

n.5D data loader.

Use this for 2.5D, 3.5D, etc. implementations. Currently only last dimension is supported as the channel dimension.

class medsegpy.data.data_loader.S25dDataLoader(cfg: medsegpy.config.Config, dataset_dicts: List[Dict], is_test: bool = False, shuffle: bool = True, drop_last: bool = False, batch_size: int = 1)[source]

Special case of 2.5D data loader compatible with 2D MedSegPy data format.

Each dataset dict should represent a slice and must have the additional keys: - “slice_id” (int): Slice id (1-indexed) that the dataset corresponds to. - “scan_num_slices” (int): Number of total slices in the scan that the

dataset dict is derived from

Padding is automatically applied to ensure all slices are considered.

This is a temporary solution until the slow loading speeds of the N5dDataLoader are properly debugged.

medsegpy.data.transforms