Skip to content

IML-DKFZ/medical-failure-detection-benchmark

Repository files navigation

This repository was created on the basis of https://github.com/IML-DKFZ/fd-shifts.

Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring.

Holistic perspective on failure detection. Detecting failures should be seen in the context of the overarching goal of preventing silent failures of a classifier, which includes two tasks: preventing failures in the first place as measured by the "robustness" of a classifier (Task 1), and detecting the non-prevented failures by means of CSFs (Task 2, focus of this work). For failure prevention across distribution shifts, a consistent task formulation exists (featuring accuracy as the primary evaluation metric) and various benchmarks have been released covering a large variety of realistic shifts (e.g. image corruption shifts, sub-class shifts, or domain shifts). In contrast, progress in the subsequent task of detecting the non-prevented failures by means of CSFs is currently obstructed by three pitfalls: 1) A diverse and inconsistent set of evaluation protocols for CSFs exists (MisD, SC, PUQ, OoD-D) impeding comprehensive competition. 2) Only a fraction of the spectrum of realistic distribution shifts and thus potential failure sources is covered diminishing the practical relevance of evaluation. 3) The task formulation in OoD-D fundamentally deviates from the stated purpose of detecting classification failures. Overall, the holistic perspective on failure detection reveals an obvious need for a unified and comprehensive evaluation protocol, in analogy to current robustness benchmarks, to make classifiers fit for safety-critical applications. Abbreviations: CSF: Confidence Scoring Function, OoD-D: Out-of-Distribution Detection, MisD: Misclassification Detection, PUQ: Predictive Uncertainty Quantification, SC: Selective Classification

Citing This Work

If you use fd-shifts please cite our paper

@inproceedings{Jaeger2022ACT,
  doi = {10.48550/ARXIV.2211.15259},
  url = {https://arxiv.org/abs/2211.15259},
  author = {Jaeger, Paul F. and Lüth, Carsten T. and Klein, Lukas and Bungert, Till J.},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Table Of Contents

Installation

FD-Shifts requires Python version 3.10 or later. It is recommended to install FD-Shifts in its own environment (venv, conda environment, ...).

  1. Install an appropriate version of PyTorch. Check that CUDA is available and that the CUDA toolkit version is compatible with your hardware. The currently necessary version of pytorch is v.1.11.0. Testing and Development was done with the pytorch version using CUDA 11.3.

  2. Install FD-Shifts. This will pull in all dependencies including some version of PyTorch, it is strongly recommended that you install a compatible version of PyTorch beforehand. This will also make the fd_shifts cli available to you.

    pip install git+https://github.com/iml-dkfz/fd-shifts.git

How to Integrate Your Own Usecase

To learn about extending FD-Shifts with your own models, datasets and confidence scoring functions check out the tutorial on extending FD-Shifts Open In Colab.

Reproducing our results

To use fd_shifts you need to set the following environment variables

export EXPERIMENT_ROOT_DIR=/absolute/path/to/your/experiments
export DATASET_ROOT_DIR=/absolute/path/to/datasets

Alternatively, you may write them to a file and source that before running fd_shifts, e.g.

mv example.env .env

Then edit .env to your needs and run

source .env

Data Folder Requirements

For the predefined experiments we expect the data to be in the following folder structure relative to the folder you set for $DATASET_ROOT_DIR.

<$DATASET_ROOT_DIR>
├── breeds
│   └── ILSVRC ⇒ ../imagenet/ILSVRC
├── imagenet
│   ├── ILSVRC
├── cifar10
├── cifar100
├── corrupt_cifar10
├── corrupt_cifar100
├── svhn
├── tinyimagenet
├── tinyimagenet_resize
├── wilds_animals
│   └── iwildcam_v2.0
└── wilds_camelyon
    └── camelyon17_v1.0

Training

To get a list of all fully qualified names for all experiments in the paper, use

fd_shifts list

You can reproduce the results of the paper either all at once:

fd_shifts launch

Some at a time:

fd_shifts launch --model=devries --dataset=cifar10

Or one at a time (use fd_shifts list to find the names of experiments):

fd_shifts launch --name=fd-shifts/svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2

Check out fd_shifts launch --help for more filtering options.

Inference

To run inference for one of the experiments, append --mode=test to any of the commands above.

Analysis

To run analysis for some of the predefined experiments, set --mode=analysis in any of the commands above.

To run analysis over an already available set of model outputs the outputs have to be in the following format:

For a classifier with d outputs and N samples in total (over all tested datasets) and for M dropout samples

raw_logits.npz
Nx(d+2)

  0, 1, ...                 d─1,   d,      d+1
┌───────────────────────────────┬───────┬─────────────┐
|           logits_1            | label | dataset_idx |
├───────────────────────────────┼───────┼─────────────┤
|           logits_2            | label | dataset_idx |
├───────────────────────────────┼───────┼─────────────┤
|           logits_3            | label | dataset_idx |
└───────────────────────────────┴───────┴─────────────┘
.
.
.
┌───────────────────────────────┬───────┬─────────────┐
|           logits_N            | label | dataset_idx |
└───────────────────────────────┴───────┴─────────────┘
external_confids.npz
Nx1
raw_logits_dist.npz
NxdxM

  0, 1, ...                  d─1
┌───────────────────────────────┐
|   logits_1 (Dropout Sample 1) |
|   logits_1 (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_1 (Dropout Sample M) |
├───────────────────────────────┤
|   logits_2 (Dropout Sample 1) |
|   logits_2 (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_2 (Dropout Sample M) |
├───────────────────────────────┤
|   logits_3 (Dropout Sample 1) |
|   logits_3 (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_3 (Dropout Sample M) |
└───────────────────────────────┘
                .
                .
                .
┌───────────────────────────────┐
|   logits_N (Dropout Sample 1) |
|   logits_N (Dropout Sample 2) |
|               .               |
|               .               |
|               .               |
|   logits_N (Dropout Sample M) |
└───────────────────────────────┘
external_confids_dist.npz
NxM

You may also use the ExperimentData class to load your data in another way. You also have to provide an adequate config, where all test datasets and query parameters are set. Check out the config files in fd_shifts/configs including the dataclasses. Importantly, the dataset_idx has to match up with the list of datasets you provide and whether or not val_tuning is set. If val_tuning is set, the validation set takes over dataset_idx=0.

Acknowledgements


         

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published