Explanations can be manipulated and Geometry is to blame (unofficial extended code)

Explanation methods aim to make neural networks more trustworthy and interpretable. In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. Namely, we show that explanations can be manipulated \emph{arbitrarily} by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant. We establish theoretically that this phenomenon can be related to certain geometrical properties of neural networks. This allows us to derive an upper bound on the susceptibility of explanations to manipulations. Based on this result, we propose effective mechanisms to enhance the robustness of explanations.

Remarks: This repository extends the original respository with the following changes (See [Patch]):

LRP-Gamma for VGG16 with the gamma values {0.5, 0.25, 0.1, 0} used in the 1st and 2nd blocks, 3rd block, 4th block, and the 5th and the classification head respectively.
Objective function takes into account the preservation of total relevance;

The attack script takes a CSV containing of original and target images and over them. It can be run by

$ head n02097474.csv
original,target
image1.jpeg,image2.jpeg
...

$python run_attack.py --data_dir ~/datasets/imagenet \
     --cuda 
     --seed_file n02097474.csv

Result from the extended code;

We also provide the results of all 50 validation images from Class Tibetan_terrier (n02097474, 200) at https://tubcloud.tu-berlin.de/s/TmZR8Yje3RcXRif.

What we do

We manipulate images so their explanation resembles an arbitrary target map. Below you can see our algorithm in action:

In our paper we show how to achieve such manipulations. We discuss their nature and derive an upper bound on how much the explanation can change. Based on this bound we propose β-smoothing, a method that can be applied to any of the considered explanation methods to increase robustness against manipulations.

β-smoothing

We have demonstrated that one can drastically change the explanation map while keeping the output of the neural network constant. We argue that this vulnerability can be related to the large curvature of the output manifold of the neural network. We focus on the gradient method. The fact that the gradient can be drastically changed by slightly perturbing the input along the hypersurface suggests that the curvature of the hypersurface is large. If we replace the ReLU activations with softplus activations with parameter β, and reduce β we can reduce the curvature of the lines of equal network output. Below you can see the smoothing in action for a two layer neural network.

Links

NeurIPS paper

archiv version

google drive

Code

Install

Install dependencies using

 pip install -r requirements.txt

Usage

Manipulate an image to reproduce a given target explanation using

python run_attack.py --cuda

For explanations beyond lrp you need to enable beta_growth so the second derivative of the activations is not zero.

python run_attack.py --cuda --method gradient --beta_growth

Plot softplus expanations for various values of beta using

python plot_expl.py --cuda

To download patterns for pattern attribution, please use the following link:

https://drive.google.com/open?id=1RdvAiUZgfhSE8sVF2JOyURpnk1HQ_hZk

Copy the downloaded file in the models subdirectory.

License

This repository is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

p16i/explanations_can_be_manipulated