Awesome Papers for Understanding LLM Mechanism

Focusing on: understanding the internal mechanism of large language models (LLM). (keep updating ...)

Conference paper recommendation: please contact me.

Why mechanistic interpretability?

https://transformer-circuits.pub/2023/interpretability-dreams/index.html

https://www.lesswrong.com/posts/uK6sQCNMw8WKzJeCQ/a-longlist-of-theories-of-impact-for-interpretability

https://www.lesswrong.com/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety

Papers

Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis. [pdf] [EMNLP 2024] [2024.9]

Scaling and evaluating sparse autoencoders. [pdf] [OpenAI] [2024.6]

How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning. [pdf] [EMNLP 2024] [2024.6]

Neuron-Level Knowledge Attribution in Large Language Models. [pdf] [EMNLP 2024] [2024.6]

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. [blog] [Anthropic] [2023.10]

Locating and Editing Factual Associations in Mamba. [pdf] [COLM 2024] [2024.4]

Chain-of-Thought Reasoning Without Prompting. [pdf] [Deepmind] [2024.2]

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. [pdf] [ICLR 2024] [2024.2]

Long-form evaluation of model editing. [pdf] [NAACL 2024] [2024.2]

What does the Knowledge Neuron Thesis Have to do with Knowledge? [pdf] [ICLR 2024] [2023.11]

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. [pdf] [ICLR 2024] [2023.11]

Interpreting CLIP's Image Representation via Text-Based Decomposition. [pdf] [ICLR 2024] [2023.10]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. [pdf] [ICLR 2024] [2023.10]

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level. [blog] [Deepmind] [2023.12]

Successor Heads: Recurring, Interpretable Attention Heads In The Wild. [pdf] [ICLR 2024] [2023.12]

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. [blog] [Anthropic] [2023.10]

Impact of Co-occurrence on Factual Knowledge of Large Language Models. [pdf] [EMNLP 2023] [2023.10]

Function vectors in large language models. [pdf] [ICLR 2024] [2023.10]

Can Large Language Models Explain Themselves? [pdf] [2023.10]

Neurons in Large Language Models: Dead, N-gram, Positional. [pdf] [ACL 2024] [2023.9]

Sparse Autoencoders Find Highly Interpretable Features in Language Models. [pdf] [ICLR 2024] [2023.9]

Do Machine Learning Models Memorize or Generalize? [blog] [2023.8]

Overthinking the Truth: Understanding how Language Models Process False Demonstrations. [pdf] [2023.7]

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. [pdf] [EMNLP 2023 best paper] [2023.5]

Let's Verify Step by Step. [pdf] [ICLR 2024] [2023.5]

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning. [pdf] [ACL 2023] [2023.5]

Language models can explain neurons in language models. [blog] [OpenAI] [2023.5]

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis [pdf] [EMNLP 2023] [2023.5]

Dissecting Recall of Factual Associations in Auto-Regressive Language Models. [pdf] [EMNLP 2023] [2023.4]

Are Emergent Abilities of Large Language Models a Mirage? [pdf] [NeurIPS 2023 best paper] [2023.4]

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression. [pdf] [2023.4]

Towards automated circuit discovery for mechanistic interpretability. [pdf] [NeurIPS 2023] [2023.4]

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. [pdf] [NeurIPS 2023] [2023.4]

A Theory of Emergent In-Context Learning as Implicit Structure Induction. [pdf] [2023.3]

Larger language models do in-context learning differently. [pdf] [Google Research] [2023.3]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. [pdf] [NeurIPs 2023] [2023.1]

Transformers as Algorithms: Generalization and Stability in In-context Learning. [pdf] [ICML 2023] [2023.1]

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. [pdf] [ACL 2023] [2022.12]

How does gpt obtain its ability? tracing emergent abilities of language models to their sources. [blog] [2022.12]

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. [pdf] [ACL 2023] [2022.12]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. [pdf] [ICLR 2023] [2022.11]

Inverse scaling can become U-shaped. [pdf] [EMNLP 2023] [2022.11]

What learning algorithm is in-context learning? Investigations with linear models. [pdf] [ICLR 2023] [2022.11]

Mass-Editing Memory in a Transformer. [pdf] [ICLR 2023] [2022.10]

Polysemanticity and Capacity in Neural Networks. [pdf] [2022.10]

Analyzing Transformers in Embedding Space. [pdf] [ACL 2023] [2022.9]

Toy Models of Superposition. [blog] [Anthropic] [2022.9]

Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango. [pdf] [2022.9]

Emergent Abilities of Large Language Models. [pdf] [Google Research] [2022.6]

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. [blog] [Anthropic] [2022.6]

Towards Tracing Factual Knowledge in Language Models Back to the Training Data. [pdf] [EMNLP 2022] [2022.5]

Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. [pdf] [EMNLP 2022] [2022.5]

Large Language Models are Zero-Shot Reasoners. [pdf] [NeurIPS 2022] [2022.5]

Scaling Laws and Interpretability of Learning from Repeated Data. [pdf] [Anthropic] [2022.5]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. [pdf] [EMNLP 2022] [2022.3]

In-context Learning and Induction Heads. [blog] [Anthropic] [2022.3]

Locating and Editing Factual Associations in GPT. [pdf] [NeurIPS 2022] [2022.2]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? [pdf] [EMNLP 2022] [2022.2]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. [pdf] [OpenAI & Google] [2022.1]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. [pdf] [NeurIPS 2022] [2022.1]

A Mathematical Framework for Transformer Circuits. [blog] [Anthropic] [2021.12]

An Explanation of In-context Learning as Implicit Bayesian Inference. [pdf] [ICLR 2022] [2021.11]

Towards a Unified View of Parameter-Efficient Transfer Learning. [pdf] [ICLR 2022] [2021.10]

Do Prompt-Based Models Really Understand the Meaning of their Prompts? [pdf] [NAACL 2022] [2021.9]

Deduplicating Training Data Makes Language Models Better. [pdf] [ACL 2022] [2021.7]

LoRA: Low-Rank Adaptation of Large Language Models. [pdf] [ICLR 2022] [2021.6]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. [pdf] [ACL 2022] [2021.4]

The Power of Scale for Parameter-Efficient Prompt Tuning. [pdf] [EMNLP 2021] [2021.4]

Calibrate Before Use: Improving Few-Shot Performance of Language Models [pdf] [ICML 2021] [2021.2]

Prefix-Tuning: Optimizing Continuous Prompts for Generation. [pdf] [ACL 2021] [2021.1]

Transformer Feed-Forward Layers Are Key-Value Memories. [pdf] [EMNLP 2021] [2020.12]

Scaling Laws for Neural Language Models. [pdf] [OpenAI] [2020.1]

Survey

Mechanistic Interpretability for AI Safety A Review. [pdf] [2024.8]

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. [pdf] [2024.7]

Internal Consistency and Self-Feedback in Large Language Models: A Survey. [pdf] [2024.7]

A Primer on the Inner Workings of Transformer-based Language Models. [pdf] [2024.5] [interpretability]

Usable XAI: 10 strategies towards exploiting explainability in the LLM era. [pdf] [2024.3] [interpretability]

A Comprehensive Overview of Large Language Models. [pdf] [2023.12] [LLM]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. [pdf] [2023.11] [hallucination]

A Survey of Large Language Models. [pdf] [2023.11] [LLM]

Explainability for Large Language Models: A Survey. [pdf] [2023.11] [interpretability]

A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. [pdf] [2023.10] [chain of thought]

Instruction tuning for large language models: A survey. [pdf] [2023.10] [instruction tuning]

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning. [pdf] [2023.9] [instruction tuning]

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. [pdf] [2023.9] [hallucination]

Reasoning with language model prompting: A survey. [pdf] [2023.9] [reasoning]

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. [pdf] [2023.8] [interpretability]

A Survey on In-context Learning. [pdf] [2023.6] [in-context learning]

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. [pdf] [2023.3] [parameter-efficient fine-tuning]

Other good LLM repos

  1. https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models (interpretability)

  2. https://github.com/cooperleong00/Awesome-LLM-Interpretability?tab=readme-ov-file (interpretability)

  3. https://github.com/JShollaj/awesome-llm-interpretability (interpretability)

  4. https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (attention)

  5. https://github.com/zjunlp/KnowledgeEditingPapers (model editing)

  6. https://github.com/Hannibal046/Awesome-LLM (LLM)