medsegpy.evaluation¶

medsegpy.evaluation.evaluator¶

Dataset evaluator.

Adopted from Facebook’s detectron2. https://github.com/facebookresearch/detectron2

class medsegpy.evaluation.evaluator.DatasetEvaluator[source]¶

Base class for a dataset evaluator.

The function inference_on_dataset() runs the model over all samples in the dataset, and have a DatasetEvaluator to process the

inputs/outputs.

This class will accumulate information of the inputs/outputs: (by process()),

and produce evaluation results in the end (by evaluate()).

reset()[source]¶: Preparation for a new round of evaluation. Should be called before starting a round of evaluation.

process(inputs, outputs)[source]¶

Process an input/output pair.

Parameters:

scan_id – the scan id corresponding to the input/output
(List[Dict]] (inputs) – the inputs that are used to call the model. Can also contain scan specific fields. These fields should start with “scan_”.
outputs (List[Dict]) – List of outputs from the model. Each dict should contain at least the following keys: * “y_true”: Ground truth results * “y_pred”: Predicted probabilities. * “time_elapsed”: Amount of time to load data and run model.

evaluate()[source]¶

Evaluate/summarize the performance, after processing all input/output pairs.

Returns:	dict – A new evaluator class can return a dict of arbitrary format as long as the user can process the results. In our train_net.py, we expect the following format: key: the name of the task (e.g., bbox) value: a dict of {metric name: score}, e.g.: {“AP50”: 80}

medsegpy.evaluation.evaluator.inference_on_dataset(model, data_loader: Union[medsegpy.data.data_loader.DataLoader, medsegpy.data.im_gens.Generator], evaluator: Union[medsegpy.evaluation.evaluator.DatasetEvaluator, typing.Sequence[medsegpy.evaluation.evaluator.DatasetEvaluator]])[source]¶

Run model on the data_loader and evaluate the metrics with evaluator. The model will be used in eval mode.

Parameters:	model (keras.Model) – generator – an iterable object with a length. The elements it generates will be the inputs to the model. evaluator (DatasetEvaluator) – the evaluator to run. Use `DatasetEvaluators([])` if you only want to benchmark, but don’t want to do any evaluation.
Returns:	The return value of evaluator.evaluate()

medsegpy.evaluation.metrics¶

Metrics Processor.

A processor keeps track of task-specific metrics for different classes. It should not be used to keep track of non-class-specific metrics, such as runtime.

class medsegpy.evaluation.metrics.Metric(units: str = '')[source]¶

Interface for new metrics.

A metric should be implemented as a callable with explicitly defined arguments. In other words, metrics should not have **kwargs or **args options in the __call__ method.

While not explicitly constrained to the return type, metrics typically return float value(s). The number of values returned corresponds to the number of categories.

metrics should have different name() for different functionality.
category_dim duck type if metric can process multiple categories at

once.

To compute metrics:

metric = Metric()
results = metric(...)

display_name()[source]¶: Name to use for pretty printing and display purposes.

class medsegpy.evaluation.metrics.DSC(units: str = '')[source]¶: Dice score coefficient.

class medsegpy.evaluation.metrics.VOE(units: str = '')[source]¶: Volumetric overlap error.

class medsegpy.evaluation.metrics.CV(units: str = '')[source]¶: Coefficient of variation.

class medsegpy.evaluation.metrics.ASSD(units: str = '')[source]¶: Average symmetric surface distance.

class medsegpy.evaluation.metrics.MetricsManager(class_names: Collection[str], metrics: Sequence[Union[medsegpy.evaluation.metrics.Metric, str]] = None)[source]¶

A class to manage and compute metrics.

Metrics will be calculated for the categories specified during instantiation. All metrics are assumed to be calculated for those categories.

Metrics are indexed by their string representation as returned by name(). They are also computed in the order they were added.

To compute metrics, use this class as a callable. See __call__ for more details.

class_names¶: Sequence[str] – Category names (in order).

To calculate metrics:

manager = MetricsManager(

category_names=(“tumor”, “no tumor”) metrics=(DSC(), VOE())

)

for scan_id, x, y_pred, y_true in zip(ids, xs, preds, ground_truths):

# Compute metrics per scan. manager(scan_id, x=x, y_pred=y_pred, y_true=y_true)

To get number of scans that have been processed:

num_scans = len(manager)

metrics()[source]¶: Returns names of current metrics.

add_metrics(metrics: Sequence[Union[medsegpy.evaluation.metrics.Metric, str]])[source]¶

Add metrics to compute.

Metrics with the same name() cannot be added.

Parameters:	metrics (Metric(s)/str(s)) – Metrics to compute. str values should only be used for built-in metrics.
Raises:	`ValueError` – If metric.name() already exists. Metrics with the same name mean the same computation will be done twice, which is not a supported feature.

remove_metrics(metrics: Union[str, typing.Sequence[str]])[source]¶

Remove metrics to compute.

Parameters:	metrics (str(s)) – Names of metrics to remove.

__call__(scan_id: str, x: numpy.ndarray = None, y_pred: numpy.ndarray = None, y_true: numpy.ndarray = None, runtime: float = nan, **kwargs) → str[source]¶

Compute metrics for a scan.

Parameters:

scan_id (str) – The scan/example identifier
x (ndarray, optional) – The input x accepted by most metrics.
y_pred (ndarray, optional) – The predicted output. For most metrics, should be binarized. If computing for multiple classes, last dimension should index different categories in the order of self.class_names.
y_true (ndarray, optional) – The binarized ground truth output. For multiple classes, format like y_pred.
runtime (float, optional) – The compute time. If specified, logged as an additional metric.

Returns:

str – A summary of the results for the scan.

scan_summary(scan_id, delimiter: str = ', ') → str[source]¶

Get summary of results for a scan.

Parameters:

scan_id – Scan id for which to summarize results.
delimiter (str, optional) – Delimiter between different metrics.

Returns:

str –

A summary of metrics for the scan. Values are averaged across: all categories.

summary()[source]¶

Get summary of results overall scans.

Returns:	str – Tabulated summary. Rows=metrics. Columns=classes.

data()[source]¶: TODO: Determine format

medsegpy.evaluation.sem_seg_evaluation¶

medsegpy.evaluation.sem_seg_evaluation.get_stats_string(manager: medsegpy.evaluation.metrics.MetricsManager)[source]¶

Returns formatted metrics manager summary string.

Parameters:	manager (MetricsManager) – The manager whose results to format.
Returns:	str – A formatted string detailing manager results.

class medsegpy.evaluation.sem_seg_evaluation.SemSegEvaluator(dataset_name: str, cfg: medsegpy.config.Config, output_folder: str = None, save_raw_data: bool = False, stream_evaluation: bool = True)[source]¶

Evaluator for semantic segmentation-related tasks.

__init__(dataset_name: str, cfg: medsegpy.config.Config, output_folder: str = None, save_raw_data: bool = False, stream_evaluation: bool = True)[source]¶

Parameters:	dataset_name (str) – name of the dataset to be evaluated. cfg – output_folder (str) – an output directory to dump results. save_raw_data (`bool`, optional) – Save probs, labels, ground truth masks to h5 file. stream_evaluation (`bool`, optional) – If True, evaluates data as it comes in to avoid holding too many objects in memory.

process(inputs, outputs)[source]¶: See DatasetEvaluator in evaluator.py for argument details.

evaluate()[source]¶

Evaluates popular medical segmentation metrics specified in config.

Evaluate on popular medical segmentation metrics. For supported segmentation metrics, see MetricsManager.
Save overlay images.
Save probability predictions.

Note, by default coefficient of variation (CV) is calculated as a root-mean-squared quantity rather than mean.