Papers on Explainable Artificial Intelligence

This is an on-going attempt to consolidate interesting efforts in the area of understanding / interpreting / explaining / visualizing a pre-trained ML model.


GUI tools

  • DeepVis: Deep Visualization Toolbox. Yosinski et al. ICML 2015 code | pdf
  • SWAP: Generate adversarial poses of objects in a 3D space. Alcorn et al. CVPR 2019 code | pdf
  • AllenNLP: Query online NLP models with user-provided inputs and observe explanations (Gradient, Integrated Gradient, SmoothGrad). Last accessed 03/2020 demo
  • 3DB: A framework for analyzing computer vision models with simulated data code

Libraries

Surveys

  • Methods for Interpreting and Understanding Deep Neural Networks. Montavon et al. 2017 pdf
  • Visualizations of Deep Neural Networks in Computer Vision: A Survey. Seifert et al. 2017 pdf
  • How convolutional neural network see the world - A survey of convolutional neural network visualization methods. Qin et al. 2018 pdf
  • A brief survey of visualization methods for deep learning models from the perspective of Explainable AI. Chalkiadakis 2018 pdf
  • A Survey Of Methods For Explaining Black Box Models. Guidotti et al. 2018 pdf
  • Understanding Neural Networks via Feature Visualization: A survey. Nguyen et al. 2019 pdf
  • Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf
  • DARPA updates on the XAI program pdf
  • Explainable Artificial Intelligence: a Systematic Review. Vilone at al. 2020 pdf

Opinions

  • Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Rudin et al. Nature 2019 pdf
  • Towards falsifiable interpretability research. Leavitt & Morcos 2020 pdf
  • Four principles of Explainable Artificial Intelligence. Phillips et al. 2021 (NIST.gov) pdf

Open research questions

  • Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges. Rudin et al 2021 pdf

Definitions of Interpretability

  • The Mythos of Model Interpretability. Lipton 2016 pdf
  • Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
  • Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf

Books

  • A Guide for Making Black Box Models Explainable. Molnar 2019 pdf

A. Explaining model inner-workings

A1. Visualizing Preferred Stimuli

Synthesizing images / Activation Maximization

  • AM: Visualizing higher-layer features of a deep network. Erhan et al. 2009 pdf
  • Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
  • DeepVis: Understanding Neural Networks through Deep Visualization. Yosinski et al. ICML workshop 2015 pdf | url
  • MFV: Multifaceted Feature Visualization: Uncovering the different types of features learned by each neuron in deep neural networks. Nguyen et al. ICML workshop 2016 pdf | code
  • DGN-AM: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Nguyen et al. NIPS 2016 pdf | code
  • PPGN: Plug and Play Generative Networks. Nguyen et al. CVPR 2017 pdf | code
  • Feature Visualization. Olah et al. 2017 url
  • Diverse feature visualizations reveal invariances in early layers of deep neural networks. Cadena et al. 2018 pdf
  • Computer Vision with a Single (Robust) Classifier. Santurkar et al. NeurIPS 2019 pdf | blog | code
  • BigGAN-AM: A cost-effective method for improving and re-purposing large, pre-trained GANs by fine-tuning their class-embeddings. Li et al. ACCV 2020 pdf | code

Real images / Segmentation Masks

  • Visualizing and Understanding Recurrent Networks. Kaparthey et al. ICLR 2015 pdf
  • Object Detectors Emerge in Deep Scene CNNs. Zhou et al. ICLR 2015 pdf
  • Understanding Deep Architectures by Interpretable Visual Summaries. Godi et al. BMVC 2019 pdf

A2. Inverting Neural Networks

A2.1 Inverting Classifiers

  • Understanding Deep Image Representations by Inverting Them. Mahendran & Vedaldi. CVPR 2015 pdf
  • Inverting Visual Representations with Convolutional Networks. Dosovitskiy & Brox. CVPR 2016 pdf
  • Neural network inversion beyond gradient descent. Wong & Kolter. NIPS workshop 2017 pdf
  • Inverting Adversarially Robust Networks for Image Synthesis. Rojas-Gomez et al. 2021 pdf | code

A2.2 Inverting Generators

  • Image Processing Using Multi-Code GAN Prior. Gu et al. 2019 pdf

A3. Distilling DNNs into more interpretable models

  • Interpreting CNNs via Decision Trees pdf
  • Distilling a Neural Network Into a Soft Decision Tree pdf
  • Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et al. 2018 pdf
  • Improving the Interpretability of Deep Neural Networks with Knowledge Distillation. Liu et al. 2018 pdf

A4. Quantitatively characterizing hidden features

  • TCAV: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors. Kim et al. 2018 pdf | code
    • DTCAV: Automating Interpretability: Discovering and Testing Visual Concepts Learned by Neural Networks. Ghorbani et al. 2019 pdf
  • SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Raghu et al. 2017 pdf | code
  • A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens. Saini et al. 2018 pdf
  • Network Dissection: Quantifying Interpretability of Deep Visual Representations. Bau et al. CVPR 2017 url | pdf
    • GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. Bau et al. ICLR 2019 pdf
    • Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks. Fong & Vedaldi CVPR 2018 pdf
    • Intriguing generalization and simplicity of adversarially trained neural networks. Chen, Agarwal, Nguyen 2020 pdf
    • Understanding the Role of Individual Units in a Deep Neural Network. Bau et al. PNAS 2020 pdf

A5. Network surgery

  • How Important Is a Neuron? Dhamdhere et al. 2018 pdf

A6. Sensitivity analysis

  • NLIZE: A Perturbation-Driven Visual Interrogation Tool for Analyzing and Interpreting Natural Language Inference Models. Liu et al. 2018 pdf

B. Explaining model decisions

B1. Attribution maps

B1.0 Surveys

  • Feature Removal Is A Unifying Principle For Model Explanation Methods. Covert et al. 2020 pdf

B1.1 White-box / Gradient-based

  • A Taxonomy and Library for Visualizing Learned Features in Convolutional Neural Networks pdf

Gradient

  • Gradient: Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
  • Deconvnet: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
  • Guided-backprop: Striving for simplicity: The all convolutional net. Springenberg et al. 2015 pdf
  • SmoothGrad: removing noise by adding noise. Smilkov et al. 2017 pdf

Input x Gradient

  • DeepLIFT: Learning important features through propagating activation differences. Shrikumar et al. 2017 pdf
  • IG: Axiomatic Attribution for Deep Networks. Sundararajan et al. 2018 pdf | code
    • EG: Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf | code
    • I-GOR: Visualizing Deep Networks by Optimizing with Integrated Gradients. Qi et al. 2019 pdf
    • BlurIG: Attribution in Scale and Space. Xu et al. CVPR 2020 pdf | code
    • XRAI: Better Attributions Through Regions. Kapishnikov et al. ICCV 2019 pdf | code
  • LRP: Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation pdf
    • DTD: Explaining NonLinear Classification Decisions With Deep Tayor Decomposition pdf

Activation map

  • CAM: Learning Deep Features for Discriminative Localization. Zhou et al. 2016 code | web

  • Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Selvaraju et al. 2017 pdf

  • Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. Chattopadhyay et al. 2017 pdf | code

  • Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models. Omeiza et al. 2019 pdf

  • NormGrad: There and Back Again: Revisiting Backpropagation Saliency Methods. Rebuffi et al. CVPR 2020 pdf | code

  • Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. Wang et al. CVPR 2020 workshop pdf | code

  • Relevance-CAM: Your Model Already Knows Where to Look. Lee et al. CVPR 2021 pdf | code

  • LIFT-CAM: Towards Better Explanations of Class Activation Mapping. Jung & Oh ICCV 2021 pdf

Learning the heatmap

  • MP: Interpretable Explanations of Black Boxes by Meaningful Perturbation. Fong et al. 2017 pdf
    • MP-G: Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal & Nguyen ACCV 2020 pdf | code
    • EP: Understanding Deep Networks via Extremal Perturbations and Smooth Masks. Fong et al. ICCV 2019 pdf | code
  • FIDO: Explaining image classifiers by counterfactual generation. Chang et al. ICLR 2019 pdf
  • FG-Vis: Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks. Wagner et al. CVPR 2019 pdf
  • CEM: Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives. Dhurandhar & Chen et al. NeurIPS 2018 pdf | code

Attributions of network biases

  • FullGrad: Full-Gradient Representation for Neural Network Visualization. Srinivas et al. NeurIPS 2019 pdf
  • Bias also matters: Bias attribution for deep neural network explanation. Wang et al. ICML 2019 pdf

Others

  • Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. Oramas et al. 2019 pdf
  • Regional Multi-scale Approach for Visually Pleasing Explanations of Deep Neural Networks. Seo et al. 2018 pdfb

B1.2 Attention as Explanation

Computer Vision

  • Multimodal explanations: Justifying decisions and pointing to the evidence. Park et al. CVPR 2018 pdf
  • IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers. Pan et al. NeurIPS 2021 pdf
  • Transformer Interpretability Beyond Attention Visualization. Hila et al. CVPR 2021 pdf | code
  • Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. Hila et al. ECCV 2021 pdf | code

NLP

  • Attention is not Explanation. Jain & Wallace. NAACL 2019 pdf
  • Attention is not not Explanation. Wiegreffe & Pinter. EMNLP 2019 pdf
  • Learning to Deceive with Attention-Based Explanations. Pruthi et al. ACL 2020 pdf

B1.3 Black-box / Perturbation-based

  • Sliding-Patch: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
  • PDA: Visualizing deep neural network decisions: Prediction difference analysis. Zintgraf et al. ICLR 2017 pdf
  • RISE: Randomized Input Sampling for Explanation of Black-box Models. Petsiuk et al. BMVC 2018 pdf
  • LIME: Why should i trust you?: Explaining the predictions of any classifier. Ribeiro et al. 2016 pdf | blog
    • LIME-G: Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal & Nguyen. ACCV 2020 pdf | code
  • SHAP: A Unified Approach to Interpreting Model Predictions. Lundberg et al. 2017 pdf | code
  • OSFT: Interpreting Black Box Models via Hypothesis Testing. Burns et al. 2019 pdf
  • IM: Interpretation of NLP models through input marginalization. Kim et al. EMNLP 2020 pdf
    • Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling. Harbecke et al. 2020 pdf

B1.4 Evaluating feature importance/attribution heatmaps

Metrics

  • Deletion & Insertion: Randomized Input Sampling for Explanation of Black-box Models. Petsiuk et al. BMVC 2018 pdf
  • ROAD: A Consistent and Efficient Evaluation Strategy for Attribution Methods. Rong & Leemann, et al. ICML 2022 pdf | code
  • ROAR: A Benchmark for Interpretability Methods in Deep Neural Networks. Hooker et al. NeurIPS 2019 pdf | code
    • DiffROAR: Do Input Gradients Highlight Discriminative Features? Shah et al. NeurIPS 2021 pdf | code
  • Sanity Checks for Saliency Maps. Adebayo et al. 2018 pdf
  • BIM: Towards Quantitative Evaluation of Attribution Methods with Ground Truth. Yang et al. 2019 pdf
  • SAM: The Sensitivity of Attribution Methods to Hyperparameters. Bansal, Agarwal, Nguyen. CVPR 2020 pdf | code

Evaluating heatmaps on humans

  • The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Nguyen, Kim, Nguyen 2021 pdf
  • Debugging Tests for Model Explanations. Adebayo et al. NeurIPS 2020 pdf
  • In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making. Fok & Weld. 2023 pdf

Computer Vision

  • The (Un)reliability of saliency methods. Kindermans et al. 2018 pdf
  • A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations. Nie et al. 2018 pdf
  • On the (In)fidelity and Sensitivity for Explanations. Yeh et al. 2019 pdf

NLP

  • Deletion_BERT: Double Trouble: How to not explain a text classifier’s decisions using counterfactuals synthesized by masked language models. Pham et al. 2022 pdf | code

  • Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Hase & Bansal ACL 2020 pdf | code

  • Teach Me to Explain: A Review of Datasets for Explainable NLP. Wiegreffe & Marasović 2021 pdf | web

Tabular data

  • Challenging common interpretability assumptions in feature attribution explanations? Dinu et al. NeurIPS workshop 2020 pdf

Many domains

  • How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods. Jeyakumar et al. NeurIPS 2020 pdf | code

B1.5 Explaining image-image similarity

  • BiLRP: Building and Interpreting Deep Similarity Models. Jie Zhou et al. TPAMI 2020 pdf
  • SANE: Why do These Match? Explaining the Behavior of Image Similarity Models. Plummer et al. ECCV 2020 pdf
  • Visualizing Deep Similarity Networks. Stylianou et al. WACV 2019 pdf | code
  • Visual Explanation for Deep Metric Learning. Zhu et al. 2019 pdf | code

Face verification

  • DISE: Explainable Face Recognition. Williford et al. ECCV 2020 pdf | code
  • xCos: An explainable cosine metric for face verification task. Lin et al. 2021 pdf | code
  • DeepFace-EMD: Re-ranking Using Patch-wise Earth Movers Distance Improves Out-Of-Distribution Face Identification. Phan & Nguyen. CVPR 2022 (pdf | code)

B2. Learning to explain

B2.1 Regularizing attribution maps

  • Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. Ross et al. IJCAI 2017 pdf
  • Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf
  • Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. Rieger et al. 2019 pdf

B2.2 Training deep nets to approximate expensive, posthoc attribution methods

  • L2E: Learning to Explain: Generating Stable Explanations Fast. Situ et al. ACL 2021 pdf | code
  • Efficient Explanations from Empirical Explainers. Schwarzenberg et al. 2021 pdf

B2.3 Explaining by prototypes

  • ProtoPNet This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al. NeurIPS 2019 pdf | code
    • This Looks Like That, Because ... Explaining Prototypes for Interpretable Image Recognition. Nauta et al. 2020 pdf | code
    • NP-ProtoPNet: These do not Look Like Those. Singh et al. 2021 pdf
  • ProtoTree Neural Prototype Trees for Interpretable Fine-grained Image Recognition. Nauta et al. CVPR 2021 pdf | code

B2.4 Explaining by retrieving supporting examples

  • EMD-Corr & CHM-Corr: Visual correspondence-based explanations improve AI robustness and human-AI team accuracy. Nguyen, Taesiri, Nguyen 2022. pdf | code

B2.5 Adversarial attacks on XAI systems with humans in the loop

  • When and How to Fool Explainable Models (and Humans) with Adversarial Examples. Vadilo et al. 2021 pdf
  • The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Nguyen, Kim, Nguyen 2021 pdf

B2.6 Others

  • Learning how to explain neural networks: PatternNet and PatternAttribution pdf
  • Deep Learning for Case-Based Reasoning through Prototypes pdf
  • Unsupervised Learning of Neural Networks to Explain Neural Networks pdf
  • Automated Rationale Generation: A Technique for Explainable AI and its Effects on Human Perceptions pdf
    • Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations pdf
  • Towards robust interpretability with self-explaining neural networks. Alvarez-Melis and Jaakola 2018 pdf

C. Counterfactual explanations

  • Counterfactual Explanations for Machine Learning: A Review. Verma et al. 2020 pdf
  • Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections. Zhang et al. 2018 pdf
  • Counterfactual Visual Explanations. Goyal et al. 2019 pdf
  • Generative Counterfactual Introspection for Explainable Deep Learning. Liu et al. 2019 pdf

Generative models

  • Generative causal explanations of black-box classifiers. O’Shaughnessy et al. 2020 pdf
  • Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal et al. 2019 pdf | code

D. Explainable AI in the real world

Medical domains

  • A systematic review on the use of explainability in deep learning systems for computer aided diagnosis in radiology: Limited use of explainable AI?. Groen et al. European Journal of Radiology 2022 pdf
  • “Help Me Help the AI”: Understanding How Explainability Can Support Human-AI Interaction. Kim et al. 2022 [pdf](https://arxiv.org/abs/2210.03735 "Practical recommendations and feedback for human-AI explanation designs from interviews with 20 end-users of Merlin, a bird-identification app.)

E. Human-AI collaboration

Computer vision

  • Human-AI Collaboration: The Effect of AI Delegation on Human Task Performance and Task Satisfaction. Hemmer et al. IUI 2023 [pdf](https://arxiv.org/abs/2303.09224 "Letting AIs handle most images in image classification and leaving the harder ones to humans result in higher overall classification accuracy than humans alone".)

F. Others

  • Yang, S. C. H., & Shafto, P. Explainable Artificial Intelligence via Bayesian Teaching. NIPS 2017 pdf
  • Explainable AI for Designers: A Human-Centered Perspective on Mixed-Initiative Co-Creation pdf
  • ICADx: Interpretable computer aided diagnosis of breast masses. Kim et al. 2018 pdf
  • Neural Network Interpretation via Fine Grained Textual Summarization. Guo et al. 2018 pdf
  • LS-Tree: Model Interpretation When the Data Are Linguistic. Chen et al. 2019 pdf