Papers on Explainable Artificial Intelligence

This is an on-going attempt to consolidate interesting efforts in the area of understanding / interpreting / explaining / visualizing a pre-trained ML model.

GUI tools

DeepVis: Deep Visualization Toolbox. Yosinski et al. ICML 2015 code | pdf
SWAP: Generate adversarial poses of objects in a 3D space. Alcorn et al. CVPR 2019 code | pdf
AllenNLP: Query online NLP models with user-provided inputs and observe explanations (Gradient, Integrated Gradient, SmoothGrad). Last accessed 03/2020 demo

Libraries

CNN visualizations (feature visualization, PyTorch)
iNNvestigate (attribution, Keras)
DeepExplain (attribution, Keras)
Lucid (feature visualization, attribution, Tensorflow)
TorchRay (attribution, PyTorch)
Captum (attribution, PyTorch)
InterpretML (attribution, Python)

Surveys

Methods for Interpreting and Understanding Deep Neural Networks. Montavon et al. 2017 pdf
Visualizations of Deep Neural Networks in Computer Vision: A Survey. Seifert et al. 2017 pdf
How convolutional neural network see the world - A survey of convolutional neural network visualization methods. Qin et al. 2018 pdf
A brief survey of visualization methods for deep learning models from the perspective of Explainable AI. Chalkiadakis 2018 pdf
A Survey Of Methods For Explaining Black Box Models. Guidotti et al. 2018 pdf
Understanding Neural Networks via Feature Visualization: A survey. Nguyen et al. 2019 pdf
Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf
DARPA updates on the XAI program pdf
Explainable Artificial Intelligence: a Systematic Review. Vilone at al. 2020 pdf

Opinions

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Rudin et al. Nature 2019 pdf

Definitions of Interpretability

The Mythos of Model Interpretability. Lipton 2016 pdf
Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf

Books

A Guide for Making Black Box Models Explainable. Molnar 2019 pdf

A. Explaining inner-workings

A1. Visualizing Preferred Stimuli

Synthesizing images / Activation Maximization

AM: Visualizing higher-layer features of a deep network. Erhan et al. 2009 pdf
Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
DeepVis: Understanding Neural Networks through Deep Visualization. Yosinski et al. ICML workshop 2015 pdf | url
MFV: Multifaceted Feature Visualization: Uncovering the different types of features learned by each neuron in deep neural networks. Nguyen et al. ICML workshop 2016 pdf | code
DGN-AM: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Nguyen et al. NIPS 2016 pdf | code
PPGN: Plug and Play Generative Networks. Nguyen et al. CVPR 2017 pdf | code
Feature Visualization. Olah et al. 2017 url
Diverse feature visualizations reveal invariances in early layers of deep neural networks. Cadena et al. 2018 pdf
Computer Vision with a Single (Robust) Classifier. Santurkar et al. NeurIPS 2019 pdf | blog | code
BigGAN-AM: Improving sample diversity of a pre-trained, class-conditional GAN by changing its class embeddings. Li et al. 2019 pdf

Real images / Segmentation Masks

Visualizing and Understanding Recurrent Networks. Kaparthey et al. ICLR 2015 pdf
Object Detectors Emerge in Deep Scene CNNs. Zhou et al. ICLR 2015 pdf
Understanding Deep Architectures by Interpretable Visual Summaries. Godi et al. BMVC 2019 pdf

A2. Inverting Neural Networks

A2.1 Inverting Classifiers

Understanding Deep Image Representations by Inverting Them. Mahendran & Vedaldi. CVPR 2015 pdf
Inverting Visual Representations with Convolutional Networks. Dosovitskiy & Brox. CVPR 2016 pdf
Neural network inversion beyond gradient descent. Wong & Kolter. NIPS workshop 2017 pdf

A2.2 Inverting Generators

Image Processing Using Multi-Code GAN Prior. Gu et al. 2019 pdf

A3. Distilling DNNs into more interpretable models

Interpreting CNNs via Decision Trees pdf
Distilling a Neural Network Into a Soft Decision Tree pdf
Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et al. 2018 pdf
Improving the Interpretability of Deep Neural Networks with Knowledge Distillation. Liu et al. 2018 pdf

A4. Quantitatively characterizing hidden features

TCAV: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors. Kim et al. 2018 pdf | code
- Automating Interpretability: Discovering and Testing Visual Concepts Learned by Neural Networks. Ghorbani et al. 2019 pdf
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Raghu et al. 2017 pdf | code
A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens. Saini et al. 2018 pdf
Network Dissection: Quantifying Interpretability of Deep Visual Representations. Bau et al. CVPR 2017 url | pdf
- GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. Bau et al. ICLR 2019 pdf
- Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks. Fong & Vedaldi CVPR 2018 pdf
- Intriguing generalization and simplicity of adversarially trained neural networks. Agarwal, Chen, Nguyen 2020 pdf
- Understanding the Role of Individual Units in a Deep Neural Network. Bau et al. PNAS 2020 pdf

A5. Network surgery

How Important Is a Neuron? Dhamdhere et al. 2018 pdf

A6. Sensitivity analysis

NLIZE: A Perturbation-Driven Visual Interrogation Tool for Analyzing and Interpreting Natural Language Inference Models. Liu et al. 2018 pdf

B. Decision explanations

B1. Attribution maps

B1.0 Surveys

Feature Removal Is A Unifying Principle For Model Explanation Methods. Covert et al. 2020 pdf

B1.1 White-box / Gradient-based

A Taxonomy and Library for Visualizing Learned Features in Convolutional Neural Networks pdf

Gradient

Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
Deconvnet: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
Guided-backprop: Striving for simplicity: The all convolutional net. Springenberg et al. 2015 pdf
SmoothGrad: removing noise by adding noise. Smilkov et al. 2017 pdf

Input x Gradient

DeepLIFT: Learning important features through propagating activation differences. Shrikumar et al. 2017 pdf
Integrated Gradients: Axiomatic Attribution for Deep Networks. Sundararajan et al. 2018 pdf | code
- Expected Gradients: Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf | code
- I-GOR: Visualizing Deep Networks by Optimizing with Integrated Gradients. Qi et al. 2019 pdf
- BlurIG: Attribution in Scale and Space. Xu et al. CVPR 2020 pdf | code
- XRAI: Better Attributions Through Regions. Kapishnikov et al. ICCV 2019 pdf | code
LRP: Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation pdf
- DTD: Explaining NonLinear Classification Decisions With Deep Tayor Decomposition pdf

Activation map

CAM: Learning Deep Features for Discriminative Localization. Zhou et al. 2016 code | web
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Selvaraju et al. 2017 pdf
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. Chattopadhyay et al. 2017 pdf | code
Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models. Omeiza et al. 2019 pdf
NormGrad: There and Back Again: Revisiting Backpropagation Saliency Methods. Rebuffi et al. CVPR 2020 pdf | code

Learning the heatmap

MP: Interpretable Explanations of Black Boxes by Meaningful Perturbation. Fong et al. 2017 pdf
- MP-G: Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal et al. 2019 pdf | code
- Understanding Deep Networks via Extremal Perturbations and Smooth Masks. Fong et al. ICCV 2019 pdf | code
FIDO: Explaining image classifiers by counterfactual generation. Chang et al. ICLR 2019 pdf
FG-Vis: Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks. Wagner et al. CVPR 2019 pdf

Attributions of network biases

Full-Gradient Representation for Neural Network Visualization. Srinivas et al. NeurIPS 2019 pdf
Bias also matters: Bias attribution for deep neural network explanation. Wang et al. ICML 2019 pdf

Others

Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. Oramas et al. 2019 pdf
Regional Multi-scale Approach for Visually Pleasing Explanations of Deep Neural Networks. Seo et al. 2018 pdfb

B1.2 Attention as Explanation

Computer Vision

Multimodal explanations: Justifying decisions and pointing to the evidence. Park et al. CVPR 2018 pdf

NLP

Attention is not Explanation. Jain & Wallace. NAACL 2019 pdf
Attention is not not Explanation. Wiegreffe & Pinter. EMNLP 2019 pdf
Learning to Deceive with Attention-Based Explanations. Pruthi et al. ACL 2020 pdf

B1.3 Black-box / Perturbation-based

Sliding-Patch: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
PDA: Visualizing deep neural network decisions: Prediction difference analysis. Zintgraf et al. ICLR 2017 pdf
RISE: Randomized Input Sampling for Explanation of Black-box Models. Petsiuk et al. BMVC 2018 pdf
LIME: Why should i trust you?: Explaining the predictions of any classifier. Ribeiro et al. 2016 pdf | blog
- LIME-G: Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal et al. 2019 pdf | code
SHAP: A Unified Approach to Interpreting Model Predictions. Lundberg et al. 2017 pdf | code
OSFT: Interpreting Black Box Models via Hypothesis Testing. Burns et al. 2019 pdf

B1.4 Evaluating heatmaps

Computer Vision

The (Un)reliability of saliency methods. Kindermans et al. 2018 pdf
ROAR: A Benchmark for Interpretability Methods in Deep Neural Networks. Hooker et al. NeurIPS 2019 pdf | code
Sanity Checks for Saliency Maps. Adebayo et al. 2018 pdf
A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations. Nie et al. 2018 pdf
BIM: Towards Quantitative Evaluation of Interpretability Methods with Ground Truth. Yang et al. 2019 pdf
On the (In)fidelity and Sensitivity for Explanations. Yeh et al. 2019 pdf
SAM: The Sensitivity of Attribution Methods to Hyperparameters. Bansal, Agarwal, Nguyen. CVPR 2020 pdf | code

NLP

Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Hase & Bansal ACL 2020 pdf | code

B2. Learning to explain

B2.1 Regularizing attribution maps

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. Ross et al. IJCAI 2017 pdf
Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf
Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. Rieger et al. 2019 pdf

B2.2 Explaining by examples (prototypes)

This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al. NeurIPS 2019 pdf | code
- ProtoPNet: This Looks Like That, Because ... Explaining Prototypes for Interpretable Image Recognition. Nauta et al. 2020 pdf
- NP-ProtoPNet: These do not Look Like Those. Singh et al. 2021 pdf

B2.3 Others

Learning how to explain neural networks: PatternNet and PatternAttribution pdf
Deep Learning for Case-Based Reasoning through Prototypes pdf
Unsupervised Learning of Neural Networks to Explain Neural Networks pdf
Automated Rationale Generation: A Technique for Explainable AI and its Effects on Human Perceptions pdf
- Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations pdf
Towards robust interpretability with self-explaining neural networks. Alvarez-Melis and Jaakola 2018 pdf

C. Counterfactual explanations

Counterfactual Explanations for Machine Learning: A Review. Verma et al. 2020 pdf
Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections. Zhang et al. 2018 pdf
Counterfactual Visual Explanations. Goyal et al. 2019 pdf
Generative Counterfactual Introspection for Explainable Deep Learning. Liu et al. 2019 pdf

Generative models

Generative causal explanations of black-box classifiers. O’Shaughnessy et al. 2020 pdf
Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal et al. 2019 pdf | code

D. Others

Yang, S. C. H., & Shafto, P. Explainable Artificial Intelligence via Bayesian Teaching. NIPS 2017 pdf
Explainable AI for Designers: A Human-Centered Perspective on Mixed-Initiative Co-Creation pdf
ICADx: Interpretable computer aided diagnosis of breast masses. Kim et al. 2018 pdf
Neural Network Interpretation via Fine Grained Textual Summarization. Guo et al. 2018 pdf
LS-Tree: Model Interpretation When the Data Are Linguistic. Chen et al. 2019 pdf

MinZHANG-WHU/XAI-papers

Papers on Explainable Artificial Intelligence

GUI tools

Libraries

Surveys

Opinions

Definitions of Interpretability

Books

A. Explaining inner-workings

A1. Visualizing Preferred Stimuli

Synthesizing images / Activation Maximization

Real images / Segmentation Masks

A2. Inverting Neural Networks

A2.1 Inverting Classifiers

A2.2 Inverting Generators

A3. Distilling DNNs into more interpretable models

A4. Quantitatively characterizing hidden features

A5. Network surgery

A6. Sensitivity analysis

B. Decision explanations

B1. Attribution maps

B1.0 Surveys

B1.1 White-box / Gradient-based

Gradient

Input x Gradient

Activation map

Learning the heatmap

Attributions of network biases

Others

B1.2 Attention as Explanation

Computer Vision

NLP

B1.3 Black-box / Perturbation-based

B1.4 Evaluating heatmaps

Computer Vision

NLP

B2. Learning to explain

B2.1 Regularizing attribution maps

B2.2 Explaining by examples (prototypes)

B2.3 Others

C. Counterfactual explanations

Generative models

D. Others