/failure_detection_benchmark

Code for the paper "Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed", TMLR 2022, Bernhardt et al.

Primary LanguageJupyter NotebookMIT LicenseMIT

Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed

Failure Detection Framework

This repository contains the code to reproduce all the experiments in the "Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed" paper. In particular it allows benchmarking of 9 different confidence scoring methods, for 6 different medical imaging datasets. We provide all the necessary training, evaluation and plotting code to reproduce all the experiments in the paper.

If you like this repository, please consider citing the associated paper

@article{
bernhardt2022failure,
title={Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed},
author={M{\'e}lanie Bernhardt and Fabio De Sousa Ribeiro and Ben Glocker},
journal={Transactions on Machine Learning Research},
year={2022},
url={https://openreview.net/forum?id=VBHuLfnOMf}
}

Overview

The repository is divided into 5 main folders:

  • data_handling contains all the code related to data loading and augmentations.
  • configs: contains all the .yml configs for the experiments in the paper as well as helper functions to load modules and data from a given config.
  • models contains all the models architecture definition and pytorch lightning modules.
  • classification contains the main training script train.py as well as all the required files to run SWAG in the classification/swag subfolder (code taken from the authors) and the files to run DUQ baseline in the classification/duq subfolder (code adapted from the authors).
  • failure_detection contains all the code for the various confidence scores.
    • The implementations of each individual score is grouped in the uncertainty_scores folder. Note that the Laplace baseline is obtained with the Laplace package, for TrustScore we used the code from the authors. Similarly in the ConfidNet sub-subfolder you will find pytorch-lightning version of the code, adapted from the code of the authors.
    • The main runner for evaluation is run_evaluation.py for which you need to provide the name of the config defining the model.
  • paper_plotting_notebooks notebooks to produce the plots summarzing the experiment results as in the paper.

All the outputs will be placed in {REPO_ROOT} / "outputs" by default.

Prerequisites

  1. Start by cloning our conda environment as specified by the environment.yml file as the root of the repository.
  2. Make sure you update the paths for the RSNA, the EyePACS and the BSUI datasets in default_paths.py to point to the root folder of the dataset as downloaded from Kaggle RSNA Pneumonia Detection Challenge for RNSA, from Kaggle EyePACS Challenge for EyePACS and from this page for BUSI. For the EyePACS dataset you will also need to download the test labels and place it at the root of the dataset repository.
  3. Make sure the root directory is in your PYTHONPATH environment variable. Ready to go!

Step-by-step example

In this section, we will walk you through all steps necessary to reproduce the experiments for OrganAMNIST and obtain the boxplots from the paper. The procedure is identical for all other experiments, you just need to change which yml config file you want to use (see next section for description of configs).

Assuming your current work directory is the root of the repository:

  1. Train the 5 classification models python classification/train.py --config medmnist/organamnist_resnet_dropout_all_layers.yml. Note if you just want to train one model you can specify a single seed in the config instead of 5 seeds.
  2. If you want to have the SWAG results, additionally train 5 SWAG models python classification/swag/train_swag.py --config medmnist/organamnist_resnet_dropout_all_layers.yml
  3. If you want to have the ConfidNet results also run python failure_detection/uncertainty_scorer/confidNet/train.py --config medmnist/organamnist_resnet_dropout_all_layers.yml
  4. If you want to have the DUQ results also run python classification/duq/train.py --config medmnist/organamnist_resnet_dropout_all_layers.yml.
  5. You are ready to run the evaluation benchmark with python failure_detection/run_evaluation.py --config medmnist/organamnist_resnet_dropout_all_layers.yml
  6. The outputs can be found in the outputs/{DATASET_NAME}/{MODEL_NAME}/{RUN_NAME} folder. Individual results for each seed are placed in separate subfolders named by the corresponding training seed. More importantly, in the failure_detection subfolder of this output folder you will find:
    • aggregated.csv the aggregated tables with average and standard deviation over seeds.
    • ensemble_metrics.csv with the metrics for the ensemble model.
  7. If you then want to reproduce the boxplots figures from the paper, you can run the paper_plotting_notebooks/aggregated_curves.ipynb notebook. Note this will save the plot inside the outputs/figures folder.

Configurations to reproduce all plots in the paper

In order to reproduce all plots in Figure 2 of the paper, you need to repeat all steps 1-7 for the following provided configs:

Just replace the config name by one of the above config names and follow the steps from the above section (note that config name argument is expected to be the relative path to the configs folder).

References

For this work, the following repositories were useful: