There has been a lot of approaches to achieve interpretability in DL; however, there are only few research directions that are promising in 2021. Therefore, I only consider NEW and likely PROMISING directions here.
Below is the layout of this hub:
-
Network conceptualization
-
Prototype-based explanations
-
Inherently-interpretable DNNs
-
Evaluation of explanations on down-stream tasks
-
Interpreting Large Foundation Models (LLMs)
-
Interactive XAI
Other categorizations are reasonable as well (i.e. from Anh Nguyen, Molnar, or lopusz). However, I'd like to curate my own layout.
I also like this distinction(1:45) between Explainable ML and Interpretable ML by Rudin Cynthia.
This line of research assigns human concepts to learned concepts of DNNs, which can make explanations more human-friendly and specific. Here I just picked a few representative papers, please contribute if any.
-
Concept Bottleneck Models https://proceedings.mlr.press/v119/koh20a.html
-
Codebook Features: Sparse and Discrete Interpretability for Neural Networks https://arxiv.org/pdf/2310.17230.pdf
-
Backpack Language Models https://arxiv.org/abs/2305.16765
-
Network Dissection: Quantifying Interpretability of Deep Visual Representations (CVPR2017) - review
-
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (ICML2018) - review
-
Towards Automatic Concept-based Explanations (NeurIPS2019) - review
-
MILAN - Natural Language Descriptions of Deep Visual Features (ICLR2022) - paper
-
LAVISE - Explaining Deep Convolutional Neural Networks via Unsupervised Visual-Semantic Filter Attention (CVPR2022) - paper
-
DISSECT: Disentangled Simultaneous Explanations via Concept Traversals (ICLR2022) - paper
The following works try to mine learned concepts from pretrained models:
-
Craft: Concept recursive activation factorization for explainability
-
A Holistic Approach to Unifying Automatic Concept Extraction and Concept Importance Estimation
-
COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP tasks
This line of research explains DNNs' decisions using the prototypes (or examples). Hence, it is inherently difficult to evaluate approaches quantitatively.
- This Looks Like That: Deep Learning for Interpretable Image Recognition (NIPS2019) - review
- This Looks Like It Rather Than That: ProtoKNN For Similarity-Based Classifiers https://openreview.net/forum?id=lh-HRYxuoRr
- Neural Prototype Trees for Interpretable Fine-grained Image Recognition https://arxiv.org/abs/2012.02046
- Explaining Latent Representations with a Corpus of Examples (NeurIPS2021)
- A Flexible Nadaraya-Watson Head Can Offer Explainable and Calibrated Classification (Trans. Mach. Learn. Res. 2022)
- Visual correspondence-based explanations improve AI robustness and human-AI team accuracy https://arxiv.org/abs/2208.00780
- AdvisingNets: Learning to Distinguish Correct and Wrong Classifications via Nearest-Neighbor Explanations https://arxiv.org/pdf/2308.13651.pdf
This line of research turns existing black-box DNNs (e.g. VGG or ResNet) into white-box models by altering and forcing them to behave in a human understandable manner.
-
SEEING IS BELIEVING: BRAIN-INSPIRED MODULAR TRAINING FOR MECHANISTIC INTERPRETABILITY
-
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead (Nature Machine Intelligence) - review
-
Concept Whitening for Interpretable Image Recognition (Nature Machine Intelligence) - review
-
Exploring the cloud of variable importance for the set of all good models (Nature Machine Intelligence) - review
-
This Looks Like That: Deep Learning for Interpretable Image Recognition (NIPS2019) - review
-
Visual correspondence-based explanations improve AI robustness and human-AI team accuracy (NeurIPS2022) - paper
-
B-cos Networks: Alignment is All We Need for Interpretability (CVPR2022) - paper
-
Neural Additive Models: Interpretable Machine Learning with Neural Nets (NeurIPS2021) - paper
As humans being the target end-users of explanations, this line of research investigates the actual effectiveness of explanations to humans in various decision-making tasks.
-
Selective Explanations: Leveraging Human Input to Align Explainable AI
-
Visual correspondence-based explanations improve AI robustness and human-AI team accuracy (NeurIPS2022) - paper
-
The effectiveness of feature attribution methods and its correlation with automatic evaluation scores (NeurIPS2021) - review
-
How Well do Feature Visualizations Support Causal Understanding of CNN Activations? (NeurIPS2021) - review
-
Explainable AI for Natural Adversarial Images (ICLR2021) - review
-
Evaluation of Saliency-based Explainability Methods (ICMLW2021) - review
-
Crowdsourcing Evaluation of Saliency-based XAI Methods (PKDD2021)
-
Quality Metrics for Transparent Machine Learning With and Without Humans In the Loop Are Not Correlated (ICMLW2021) - review
-
How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods (NeurIPS2020) - review
-
Debugging Tests for Model Explanations (NeurIPS2020) - review
-
What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods (NeurIPS2022) - review
-
HIVE: Evaluating the Human Interpretability of Visual Explanations (ECCV2022) - review
-
Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation (ICLR2022) - review
-
Graphical Perception of Saliency-based Model Explanations, https://dl.acm.org/doi/pdf/10.1145/3544548.3581320
-
Humans, AI, and Context: Understanding End-Users’ Trust in a Real-World Computer Vision Application, https://dl.acm.org/doi/pdf/10.1145/3593013.3593978
-
A user interface to communicate interpretable AI decisions to radiologists, https://spie.org/Publications/Proceedings/Paper/10.1117/12.2654068?SSO=1
-
Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction, https://dl.acm.org/doi/abs/10.1145/3544548.3581001
-
Interpretable deep learning models for better clinician-AI communication in clinical mammography, https://www.spiedigitallibrary.org/conference-proceedings-of-spie/12035/1203507/Interpretable-deep-learning-models-for-better-clinician-AI-communication-in/10.1117/12.2612372.full
-
The XAI Alignment Problem: Rethinking How Should We Evaluate Human-Centered AI Explainability Techniques, https://arxiv.org/abs/2303.17707.
-
AdvisingNets: Learning to Distinguish Correct and Wrong Classifications via Nearest-Neighbor Explanations https://arxiv.org/pdf/2308.13651.pdf
- Rethinking Interpretability in the Era of Large Language Models, https://arxiv.org/html/2402.01761v1
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning https://transformer-circuits.pub/2023/monosemantic-features
- Toy Models of Superposition https://transformer-circuits.pub/2022/toy_model/index.html
- Explaining decision-making algorithms through UI: Strategies to help non-expert stakeholders
- An Interactive UI to Support Sensemaking over Collections of Parallel Texts
- Rethinking Explainability as a Dialogue: A Practitioner's Perspective
- May I Ask a Follow-up Question? Understanding the Benefits of Conversations in Neural Network Explainability
- Explaining machine learning models with interactive natural language conversations using TalkToModel
- Allowing humans to interactively guide machines where to look does not always improve a human-AI team's classification accuracy