/sadl

Code release of a paper "Guiding Deep Learning System Testing using Surprise Adequacy"

Primary LanguagePythonMIT LicenseMIT

[Update April, 2021] Checkout a recent paper with fast, efficient implementation of SA: https://github.com/testingautomated-usi/surprise-adequacy. Big thanks to the authors! 😃

Guiding Deep Learning System Testing using Surprise Adequacy

DOI

Code release of a paper "Guiding Deep Learning System Testing using Surprise Adequacy"

If you find this paper helpful, consider cite the paper:

@inproceedings{Kim2019aa,
	Author = {Jinhan Kim and Robert Feldt and Shin Yoo},
	Booktitle = {Proceedings of the 41th International Conference on Software Engineering},	
	Pages = {1039-1049},
	Publisher = {IEEE Press},
	Series = {ICSE 2019},
	Title = {Guiding Deep Learning System Testing using Surprise Adequacy},
	Year = {2019}}
}

Introduction

This archive includes code for computing Surprise Adequacy (SA) and Surprise Coverage (SC), which are basic components of the main experiments in the paper. Currently, the "run.py" script contains a simple example that calculates SA and SC of a test set and an adversarial set generated using FGSM method for the MNIST dataset, only considering the last hidden layer (activation_3). Layer selection can be easily changed by modifying layer_names in run.py.

Files and Directories

  • run.py - Script processing SA with a benign dataset and adversarial examples (MNIST and CIFAR-10).
  • sa.py - Tools that fetch activation traces, compute LSA and DSA, and coverage.
  • train_model.py - Model training script for MNIST and CIFAR-10. It keeps the trained models in the "model" directory (code from Ma et al.).
  • model directory - Used for saving models.
  • tmp directory - Used for saving activation traces and prediction arrays.
  • adv directory - Used for saving adversarial examples.

Command-line Options of run.py

  • -d - The subject dataset (either mnist or cifar). Default is mnist.
  • -lsa - If set, computes LSA.
  • -dsa - If set, computes DSA.
  • -target - The name of target input set. Default is fsgm.
  • -save_path - The temporal save path of AT files. Default is tmp directory.
  • -batch_size - Batch size. Default is 128.
  • -var_threshold - Variance threshold. Default is 1e-5.
  • -upper_bound - Upper bound of SA. Default is 2000.
  • -n_bucket - The number of buckets for coverage. Default is 1000.
  • -num_classes - The number of classes in dataset. Default is 10.
  • -is_classification - Set if task is classification problem. Default is True.

Generating Adversarial Examples

We used the framework by Ma et al. to generate various adversarial examples (FGSM, BIM-A, BIM-B, JSMA, and C&W). Please refer to craft_adv_samples.py in the above repository of Ma et al., and put them in the adv directory. For a basic usage example, there is an included adversarial set generated by the FSGM method for MNIST (See file ./adv/adv_mnist_fgsm.npy).

Udacity Self-driving Car Challenge

To reproduce the result of Udacity self-driving car challenge, please refer to the DeepXplore and DeepTest repositories, which contain information about the dataset, models (Dave-2, Chauffeur), and synthetic data generation processes. It might take a few hours to get the dataset and the models due to their sizes.

How to Use

Our implementation is based on Python 3.5.2, Tensorflow 1.9.0, Keras 2.2, Numpy 1.14.5. Details are listed in requirements.txt.

This is a simple example of installation and computing LSA or DSA of a test set and FGSM in MNIST dataset.

# install Python dependencies
pip install -r requirements.txt

# train a model
python train_model.py -d mnist

# calculate LSA, coverage, and ROC-AUC score
python run.py -lsa

# calculate DSA, coverage, and ROC-AUC score
python run.py -dsa

Notes

  • If you encounter ValueError: Input contains NaN, infinity or a value too large for dtype ('float64'). error, you need to increase the variance threshold. Please refer to the configuration details in the paper (Section IV-C).
  • Images were processed by clipping its pixels in between -0.5 and 0.5.
  • If you want to select specific layers, you can modify the layers array in run.py.
  • Coverage may vary depending on the upper bound.
  • For speed-up, use GPU-based tensorflow.
  • All experimental results

References