Awesome LLM Interpretability

A curated list of amazingly awesome tools, papers, articles, and communities focused on Large Language Model (LLM) Interpretability.

Awesome LLM Interpretability

LLM Interpretability Tools

Tools and libraries for LLM interpretability and analysis.

Comgra - Comgra helps you analyze and debug neural networks in pytorch.
Pythia - Interpretability analysis to understand how knowledge develops and evolves during training in autoregressive transformers.
Phoenix - AI Observability & Evaluation - Evaluate, troubleshoot, and fine tune your LLM, CV, and NLP models in a notebook.
Automated Interpretability - Code for automatically generating, simulating, and scoring explanations of neuron behavior.
Fmr.ai - AI interpretability and explainability platform.
Attention Analysis - Analyzing attention maps from BERT transformer.
SpellGPT - Explores GPT-3’s ability to spell own token strings.
SuperICL - Super In-Context Learning code which allows black-box LLMs to work with locally fine-tuned smaller models.
Git Re-Basin - Code release for "Git Re-Basin: Merging Models modulo Permutation Symmetries.”
Functionary - Chat language model that can interpret and execute functions/plugins.
Sparse Autoencoder - Sparse Autoencoder for Mechanistic Interpretability.
Rome - Locating and editing factual associations in GPT.
Inseq - Interpretability for sequence generation models.
Neuron Viewer - Tool for viewing neuron activations and explanations.
LLM Visualization - Visualizing LLMs in low level.

LLM Interpretability Papers

Academic and industry papers on LLM interpretability.

Finding Neurons in a Haystack: Case Studies with Sparse Probing - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs).
Copy Suppression: Comprehensively Understanding an Attention Head - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression.
Linear Representations of Sentiment in Large Language Models - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models.
Emergent world representations: Exploring a sequence model trained on a synthetic task - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello.
Towards Automated Circuit Discovery for Mechanistic Interpretability - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks.
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations - Examines small neural networks to understand how they learn group compositions, using representation theory.
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias - Causal mediation analysis as a method for interpreting neural models in natural language processing.
The Quantization Model of Neural Scaling - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
Discovering Latent Knowledge in Language Models Without Supervision - Presents a method for extracting accurate answers to yes-no questions from language models' internal activations without supervision.
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features.
Language models can explain neurons in language models - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models.
Emergent Linear Representations in World Models of Self-Supervised Sequence Models - Linear representations in a world model of Othello-playing sequence models.
"Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model" - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs.
"Successor Heads: Recurring, Interpretable Attention Heads In The Wild" - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLM’s.
"Large Language Models Are Not Robust Multiple Choice Selectors" - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent "selection bias”.
"Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory" - Presents a novel approach to understanding neural networks by examining feature complexity through category theory.
"Let's Verify Step by Step" - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback.
"Interpretability Illusions in the Generalization of Simplified Models" - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios.
"The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models" - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'.
"Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition" - Investigates how LLMs perform the task of mathematical addition.
"Measuring Feature Sparsity in Language Models" - Develops metrics to evaluate the success of sparse coding techniques in language model activations.
Toy Models of Superposition - Investigates how models represent more features than dimensions, especially when features are sparse.
Spine: Sparse interpretable neural embeddings - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders.
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors - Introduces a novel method for visualizing transformer networks using dictionary learning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - Introduces Pythia, a toolset designed for analyzing the training and scaling behaviors of LLMs.
On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron - Critically examines the effectiveness of the "Sentiment Neuron”.
Engineering monosemanticity in toy models - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features.
Polysemanticity and capacity in neural networks - Investigates polysemanticity in neural networks, where individual neurons represent multiple features.
An Overview of Early Vision in InceptionV1 - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision.
Visualizing and measuring the geometry of BERT - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects.
Neurons in Large Language Models: Dead, N-gram, Positional - An analysis of neurons in large language models, focusing on the OPT family.
Can Large Language Models Explain Themselves? - Evaluates the effectiveness of self-explanations generated by LLMs in sentiment analysis tasks.
Interpretability in the Wild: GPT-2 small (arXiv) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing.
Sparse Autoencoders Find Highly Interpretable Features in Language Models - Explores the use of sparse autoencoders to extract more interpretable and less polysemantic features from LLMs.
Emergent and Predictable Memorization in Large Language Models - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs.
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - This paper investigates the representation of truth in Large Language Models (LLMs) using true/false datasets.
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca.
Representation Engineering: A Top-Down Approach to AI Transparency - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits.

LLM Interpretability Articles

Insightful articles and blog posts on LLM interpretability.

A New Approach to Computation Reimagines Artificial Intelligenceg - Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence.
Interpreting GPT: the logit lens - Explores how the logit lens, reveals a gradual convergence of GPT's probabilistic predictions across its layers, from initial nonsensical or shallow guesses to more refined predictions.
A Mechanistic Interpretability Analysis of Grokking - Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training.
200 Concrete Open Problems in Mechanistic Interpretability - Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks.
Evaluating LLMs is a minefield - Challenges in assessing the performance and biases of large language models (LLMs) like GPT.
Attribution Patching: Activation Patching At Industrial Scale - Method that uses gradients for a linear approximation of activation patching in neural networks.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] - Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks.
A circuit for Python docstrings in a 4-layer attention-only transformer - Proposes the Quantization Model for explaining neural scaling laws in neural networks.
Discovering Latent Knowledge in Language Models Without Supervision - Examines a specific neural circuit within a 4-layer transformer model responsible for generating Python docstrings.

LLM Interpretability Groups

Communities and groups dedicated to LLM interpretability.

Alignment Lab AI - Group of researchers focusing on AI alignment.
Nous Research - Research group discussing various topics on interpretability.
EleutherAI - Non-profit AI research lab that focuses on interpretability and alignment of large models.

Contributing and Collaborating

Please see CONTRIBUTING and CODE-OF-CONDUCT for details.