Awesome LLM Interpretability

A curated list of amazingly awesome tools, papers, articles, and communities focused on Large Language Model (LLM) Interpretability.

Awesome LLM Interpretability
- Tools
- Papers
- Articles
- Groups

LLM Interpretability Tools

Tools and libraries for LLM interpretability and analysis.

The Learning Interpretability Tool - an open-source platform for visualization and understanding of ML models, supports classification, refression, and generative models (text & image data); includes saliency methods, attention attribution, counter-facturals, TCAV, embedding visualizations, and facets style data analysis.
Comgra - Comgra helps you analyze and debug neural networks in pytorch.
Pythia - Interpretability analysis to understand how knowledge develops and evolves during training in autoregressive transformers.
Phoenix - AI Observability & Evaluation - Evaluate, troubleshoot, and fine tune your LLM, CV, and NLP models in a notebook.
Automated Interpretability - Code for automatically generating, simulating, and scoring explanations of neuron behavior.
Fmr.ai - AI interpretability and explainability platform.
Attention Analysis - Analyzing attention maps from BERT transformer.
SpellGPT - Explores GPT-3’s ability to spell own token strings.
SuperICL - Super In-Context Learning code which allows black-box LLMs to work with locally fine-tuned smaller models.
Git Re-Basin - Code release for "Git Re-Basin: Merging Models modulo Permutation Symmetries.”
Functionary - Chat language model that can interpret and execute functions/plugins.
Sparse Autoencoder - Sparse Autoencoder for Mechanistic Interpretability.
Rome - Locating and editing factual associations in GPT.
Inseq - Interpretability for sequence generation models.
Neuron Viewer - Tool for viewing neuron activations and explanations.
LLM Visualization - Visualizing LLMs in low level.
Vanna - Abstractions to use RAG to generate SQL with any LLM
Copy Suppression - Designed to help explore different prompts for GPT-2 Small, as part of a research project regarding copy-suppression in LLMs.
TransformerViz - Interative tool to visualize transformer model by its latent space.
TransformerLens - A Library for Mechanistic Interpretability of Generative Language Models.

LLM Interpretability Papers

Academic and industry papers on LLM interpretability.

Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models (e.g. linear probes etc) can be prone to generalisation illusions.
Self-Influence Guided Data Reweighting for Language Model Pre-training] - An application of training data attribution methods to re-weight training data and improve performance.
Data Similarity is Not Enough to Explain Language Model Performance - Discusses the limits of embedding models to explain data effective selection.
Post Hoc Explanations of Language Models Can Improve Language Models] - Evaluates language-model generated explanations ability to also improve model quality.
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models, Tweet Summary] (NeurIPS 2023 Spotlight) - highlights the limits of Causal Tracing: how a fact is stored in an LLM can be changed by editing weights in a different location than where Causal Tracing suggests.
Finding Neurons in a Haystack: Case Studies with Sparse Probing - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs).
Copy Suppression: Comprehensively Understanding an Attention Head - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression.
Linear Representations of Sentiment in Large Language Models - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models.
Emergent world representations: Exploring a sequence model trained on a synthetic task - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello.
Towards Automated Circuit Discovery for Mechanistic Interpretability - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks.
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations - Examines small neural networks to understand how they learn group compositions, using representation theory.
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias - Causal mediation analysis as a method for interpreting neural models in natural language processing.
The Quantization Model of Neural Scaling - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
Discovering Latent Knowledge in Language Models Without Supervision - Presents a method for extracting accurate answers to yes-no questions from language models' internal activations without supervision.
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features.
Language models can explain neurons in language models - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models.
Emergent Linear Representations in World Models of Self-Supervised Sequence Models - Linear representations in a world model of Othello-playing sequence models.
"Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model" - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs.
"Successor Heads: Recurring, Interpretable Attention Heads In The Wild" - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLM’s.
"Large Language Models Are Not Robust Multiple Choice Selectors" - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent "selection bias”.
"Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory" - Presents a novel approach to understanding neural networks by examining feature complexity through category theory.
"Let's Verify Step by Step" - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback.
"Interpretability Illusions in the Generalization of Simplified Models" - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios.
"The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models" - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'.
"Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition" - Investigates how LLMs perform the task of mathematical addition.
"Measuring Feature Sparsity in Language Models" - Develops metrics to evaluate the success of sparse coding techniques in language model activations.
Toy Models of Superposition - Investigates how models represent more features than dimensions, especially when features are sparse.
Spine: Sparse interpretable neural embeddings - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders.
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors - Introduces a novel method for visualizing transformer networks using dictionary learning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - Introduces Pythia, a toolset designed for analyzing the training and scaling behaviors of LLMs.
On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron - Critically examines the effectiveness of the "Sentiment Neuron”.
Engineering monosemanticity in toy models - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features.
Polysemanticity and capacity in neural networks - Investigates polysemanticity in neural networks, where individual neurons represent multiple features.
An Overview of Early Vision in InceptionV1 - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.
Visualizing and measuring the geometry of BERT - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects.
Neurons in Large Language Models: Dead, N-gram, Positional - An analysis of neurons in large language models, focusing on the OPT family.
Can Large Language Models Explain Themselves? - Evaluates the effectiveness of self-explanations generated by LLMs in sentiment analysis tasks.
Interpretability in the Wild: GPT-2 small (arXiv) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing.
Sparse Autoencoders Find Highly Interpretable Features in Language Models - Explores the use of sparse autoencoders to extract more interpretable and less polysemantic features from LLMs.
Emergent and Predictable Memorization in Large Language Models - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs.
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - This paper investigates the representation of truth in Large Language Models (LLMs) using true/false datasets.
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca.
Representation Engineering: A Top-Down Approach to AI Transparency - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits.
Explaining black box text modules in natural language with language models - Natural language explanations for LLM attention heads, evaluated using synthetic text
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models - Explain each LLM neuron as a graph
Augmenting Interpretable Models with LLMs during Training - Use LLMs to build interpretable classifiers of text data
ChainPoll: A High Efficacy Method for LLM Hallucination Detection - ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature.

LLM Interpretability Articles

Insightful articles and blog posts on LLM interpretability.

Do Machine Learning Models Memorize or Generalize? - an interactive visualization exploring the phenopmena known as Grokking (VISxAI hall of fame)
What Have Language Models Learned? - an interactive visualization to undertsand how large language models work, and understand the nature of their biases (VISxAI hall of fame)
A New Approach to Computation Reimagines Artificial Intelligenceg - Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence.
Interpreting GPT: the logit lens - Explores how the logit lens, reveals a gradual convergence of GPT's probabilistic predictions across its layers, from initial nonsensical or shallow guesses to more refined predictions.
A Mechanistic Interpretability Analysis of Grokking - Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training.
200 Concrete Open Problems in Mechanistic Interpretability - Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks.
Evaluating LLMs is a minefield - Challenges in assessing the performance and biases of large language models (LLMs) like GPT.
Attribution Patching: Activation Patching At Industrial Scale - Method that uses gradients for a linear approximation of activation patching in neural networks.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] - Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks.
A circuit for Python docstrings in a 4-layer attention-only transformer - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
Discovering Latent Knowledge in Language Models Without Supervision - Examines a specific neural circuit within a 4-layer transformer model responsible for generating Python docstrings.
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks - Survey on mechanistic interpretability

LLM Interpretability Groups

Communities and groups dedicated to LLM interpretability.

PAIR - at Google work on opensource tools, interactive explorables visualizations and research interpretability methods.
Alignment Lab AI - Group of researchers focusing on AI alignment.
Nous Research - Research group discussing various topics on interpretability.
EleutherAI - Non-profit AI research lab that focuses on interpretability and alignment of large models.

Contributing and Collaborating

Please see CONTRIBUTING and CODE-OF-CONDUCT for details.

AAAEEEE/awesome-llm-interpretability