Robustness Metrics provides lightweight modules in order to evaluate the robustness of classification models across three sets of metrics:
- out-of-distribution generalization (e.g. a non-expert human would be able to classify similar objects, but possibly changed viewpoint, scene setting or clutter).
- stability (of the prediction and predicted probabilities) under natural perturbation of the input.
- uncertainty (e.g. assessing to which extent the probabilities predicted by a model reflect the true probabilities)
The library includes popular out-of-distribution datasets (ImageNetV2, ImageNet-C, etc.) and can be readily applied to benchmark arbitrary models and is not limited to vision models: any mapping from input -> logits will do.
First, install the library and its dependencies as
python setup.py install
or directly from the repository as
pip install "git+https://github.com/google-research/robustness_metrics.git#egg=robustness_metrics"
There are three steps to evaluate a model: 1. import the model; 2. launch an experiment; and 3. examine results.
Import your model by writing a file that specifies how to make predictions and
how we should pre-process the data. This file contains a single function
create
that returns a tuple of:
- a function
predict(features: Dict[str, Tensor]
that takes a batch from the dataset and computes your model predictions; and - a pre-processing function that will be applied to the dataset.
The latter can be omitted (return None
) in which case we will use default
pre-processing. For ImageNet, this is a central crop to (224, 224) and scales
the pixel values to the range [-1, +1]. The dictionary holding the inputs
features
follows the same naming convention as tensorflow_datasets
. As all
imported datasets are currently image datasets, this means that a batch of
images will be stored in the field features["image"]
.
For examples, see models
. You don't need to store the model file there.
Parameterized models. Sometimes you have multiple model variants that you
would like to to test, e.g., different model sizes or training datasets. To
achieve this, add arguments to the create
function, e.g. create(network_type, network_width)
to import networks of varying widths and sizes.
Non-TensorFlow models. If your model is not written in TensorFlow, you can
convert the data to numpy and feed those to your model. For example, take a look
at models/random_imagenet_numpy.py
, models/vit.py
for a JAX model, and
models/vgg.py
for a PyTorch model. Please do not forget to set the flag
--tf_on_cpu
in compute_report.py
.
You can either run the launcher to compute a specific set of measurements (e.g.
accuracy on ImageNet, expected calibration error on ImageNet-A) which is done
via the --measurement
flag, or you can compute all the measurements that are
necessary for a specific robustness report, done using the --report
flag.
Note that the library is using tensorflow_datasets
to load the data. If
you are loading them for the first time on your system, then it will first
download and serialize them to a local directory.
Launch bin/compute_report.py
, passing in your model
file in model_path
. If your create
function has parameters, you can
pass them via the --model_args
flag (as Python code, it will be
literal_eval
'ed).
You can explicitly specify the set of measurements you want to make
python3 bin/compute_report.py \
--model_path=models/random_imagenet_numpy.py \
--measurement="accuracy@imagenet" \
--measurement="nll@imagenet_v2(variant='MATCHED_FREQUENCY')" \
--measurement="ece@imagenet_a"
or, alternatively, you can use one of the reports we provide, e.g.
python3 bin/compute_report.py \
--model_path=models/random_imagenet_numpy.py \
--report="classification_report(datasets=['imagenet'])"
For the list of reports, please see reports/
.
We provide several models in the directory models/
, that you can run to
reproduce their results. The models are serialized as tensorflow_hub
models and will be automatically downloaded to your disk. For example:
python3 bin/compute_report.py \
--model_path=models/bit.py \
--model_args="dataset='Imagenet21k',network='R50',size='x1'" \
--measurement="accuracy@imagenet" \
--measurement="nll@imagenet_v2(variant='MATCHED_FREQUENCY')" \
--measurement="ece@imagenet_a"
If you are running non-TensorFlow models (for example models/vit.py
is a JAX
model, and models/vgg.py
is in PyTorch), please set the flag --tf_on_cpu
.
To see results, look at the printed output.