This repository was created on the basis of https://github.com/IML-DKFZ/fd-shifts.
Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring.
Holistic perspective on failure detection. Detecting failures should be seen in the context of the overarching goal of preventing silent failures of a classifier, which includes two tasks: preventing failures in the first place as measured by the "robustness" of a classifier (Task 1), and detecting the non-prevented failures by means of CSFs (Task 2, focus of this work). For failure prevention across distribution shifts, a consistent task formulation exists (featuring accuracy as the primary evaluation metric) and various benchmarks have been released covering a large variety of realistic shifts (e.g. image corruption shifts, sub-class shifts, or domain shifts). In contrast, progress in the subsequent task of detecting the non-prevented failures by means of CSFs is currently obstructed by three pitfalls: 1) A diverse and inconsistent set of evaluation protocols for CSFs exists (MisD, SC, PUQ, OoD-D) impeding comprehensive competition. 2) Only a fraction of the spectrum of realistic distribution shifts and thus potential failure sources is covered diminishing the practical relevance of evaluation. 3) The task formulation in OoD-D fundamentally deviates from the stated purpose of detecting classification failures. Overall, the holistic perspective on failure detection reveals an obvious need for a unified and comprehensive evaluation protocol, in analogy to current robustness benchmarks, to make classifiers fit for safety-critical applications. Abbreviations: CSF: Confidence Scoring Function, OoD-D: Out-of-Distribution Detection, MisD: Misclassification Detection, PUQ: Predictive Uncertainty Quantification, SC: Selective Classification
If you use fd-shifts please cite our paper
@inproceedings{Jaeger2022ACT,
doi = {10.48550/ARXIV.2211.15259},
url = {https://arxiv.org/abs/2211.15259},
author = {Jaeger, Paul F. and Lüth, Carsten T. and Klein, Lukas and Bungert, Till J.},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
FD-Shifts requires Python version 3.10 or later. It is recommended to install FD-Shifts in its own environment (venv, conda environment, ...).
-
Install an appropriate version of PyTorch. Check that CUDA is available and that the CUDA toolkit version is compatible with your hardware. The currently necessary version of pytorch is v.1.11.0. Testing and Development was done with the pytorch version using CUDA 11.3.
-
Install FD-Shifts. This will pull in all dependencies including some version of PyTorch, it is strongly recommended that you install a compatible version of PyTorch beforehand. This will also make the
fd_shifts
cli available to you.pip install git+https://github.com/iml-dkfz/fd-shifts.git
To learn about extending FD-Shifts with your own models, datasets and confidence scoring functions check out the tutorial on extending FD-Shifts .
To use fd_shifts
you need to set the following environment variables
export EXPERIMENT_ROOT_DIR=/absolute/path/to/your/experiments
export DATASET_ROOT_DIR=/absolute/path/to/datasets
Alternatively, you may write them to a file and source that before running
fd_shifts
, e.g.
mv example.env .env
Then edit .env
to your needs and run
source .env
For the predefined experiments we expect the data to be in the following folder
structure relative to the folder you set for $DATASET_ROOT_DIR
.
<$DATASET_ROOT_DIR>
├── breeds
│ └── ILSVRC ⇒ ../imagenet/ILSVRC
├── imagenet
│ ├── ILSVRC
├── cifar10
├── cifar100
├── corrupt_cifar10
├── corrupt_cifar100
├── svhn
├── tinyimagenet
├── tinyimagenet_resize
├── wilds_animals
│ └── iwildcam_v2.0
└── wilds_camelyon
└── camelyon17_v1.0
To get a list of all fully qualified names for all experiments in the paper, use
fd_shifts list
You can reproduce the results of the paper either all at once:
fd_shifts launch
Some at a time:
fd_shifts launch --model=devries --dataset=cifar10
Or one at a time (use fd_shifts list
to find the names of experiments):
fd_shifts launch --name=fd-shifts/svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
Check out fd_shifts launch --help
for more filtering options.
To run inference for one of the experiments, append --mode=test
to any of the
commands above.
To run analysis for some of the predefined experiments, set --mode=analysis
in
any of the commands above.
To run analysis over an already available set of model outputs the outputs have to be in the following format:
For a classifier with d
outputs and N
samples in total (over all tested
datasets) and for M
dropout samples
raw_logits.npz
Nx(d+2)
0, 1, ... d─1, d, d+1
┌───────────────────────────────┬───────┬─────────────┐
| logits_1 | label | dataset_idx |
├───────────────────────────────┼───────┼─────────────┤
| logits_2 | label | dataset_idx |
├───────────────────────────────┼───────┼─────────────┤
| logits_3 | label | dataset_idx |
└───────────────────────────────┴───────┴─────────────┘
.
.
.
┌───────────────────────────────┬───────┬─────────────┐
| logits_N | label | dataset_idx |
└───────────────────────────────┴───────┴─────────────┘
external_confids.npz
Nx1
raw_logits_dist.npz
NxdxM
0, 1, ... d─1
┌───────────────────────────────┐
| logits_1 (Dropout Sample 1) |
| logits_1 (Dropout Sample 2) |
| . |
| . |
| . |
| logits_1 (Dropout Sample M) |
├───────────────────────────────┤
| logits_2 (Dropout Sample 1) |
| logits_2 (Dropout Sample 2) |
| . |
| . |
| . |
| logits_2 (Dropout Sample M) |
├───────────────────────────────┤
| logits_3 (Dropout Sample 1) |
| logits_3 (Dropout Sample 2) |
| . |
| . |
| . |
| logits_3 (Dropout Sample M) |
└───────────────────────────────┘
.
.
.
┌───────────────────────────────┐
| logits_N (Dropout Sample 1) |
| logits_N (Dropout Sample 2) |
| . |
| . |
| . |
| logits_N (Dropout Sample M) |
└───────────────────────────────┘
external_confids_dist.npz
NxM
You may also use the ExperimentData
class to load your data in another way.
You also have to provide an adequate config, where all test datasets and query
parameters are set. Check out the config files in fd_shifts/configs
including
the dataclasses. Importantly, the dataset_idx
has to match up with the list of
datasets you provide and whether or not val_tuning
is set. If val_tuning
is
set, the validation set takes over dataset_idx=0
.