awesome-llm-papers-interpretability (after 2020)

Focusing on: interpretability of large language models (LLM). (keep updating when I read good papers ...)

survey

A Comprehensive Overview of Large Language Models. [pdf] [2023.12]

A Survey of Large Language Models. [pdf] [2023.11]

Explainability for Large Language Models: A Survey. [pdf] [2023.11]

A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. [pdf] [2023.10]

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. [pdf] [2023.8]

A Survey on In-context Learning. [pdf] [2023.6]

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. [pdf] [2023.3]

papers

How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning [pdf] [2024.2]

Locating Factual Knowledge in Large Language Models: Exploring the Residual Stream and Analyzing Subvalues in Vocabulary Space [pdf] [2023.12]

Do Machine Learning Models Memorize or Generalize? [blog] [2023.8]

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. [pdf] [EMNLP 2023] [2023.5]

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning. [pdf] [ACL 2023] [2023.5]

Language models can explain neurons in language models. [blog] [2023.5]

Dissecting Recall of Factual Associations in Auto-Regressive Language Models. [pdf] [EMNLP 2023] [2023.4]

Are Emergent Abilities of Large Language Models a Mirage? [pdf] [NeurIPS 2023] [2023.4]

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression. [pdf] [ICLR 2024] [2023.4]

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. [pdf] [NeurIPS 2023] [2023.4]

A Theory of Emergent In-Context Learning as Implicit Structure Induction. [pdf] [2023.3]

Larger language models do in-context learning differently. [pdf] [ICLR 2024] [2023.3]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. [pdf] [NeurIPs 2023] [2023.1]

Transformers as Algorithms: Generalization and Stability in In-context Learning. [pdf] [ICML 2023] [2023.1]

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. [pdf] [ACL 2023] [2022.12]

How does gpt obtain its ability? tracing emergent abilities of language models to their sources. [blog] [2022.12]

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. [pdf] [ACL 2023] [2022.12]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. [pdf] [ICLR 2023] [2022.11]

Inverse scaling can become U-shaped. [pdf] [EMNLP 2023] [2022.11]

What learning algorithm is in-context learning? Investigations with linear models. [pdf] [ICLR 2023] [2022.11]

Mass-Editing Memory in a Transformer. [pdf] [ICLR 2023] [2022.10]

Polysemanticity and Capacity in Neural Networks. [pdf] [2022.10]

Analyzing Transformers in Embedding Space. [pdf] [ACL 2023] [2022.9]

Toy Models of Superposition. [blog] [2022.9]

Emergent Abilities of Large Language Models. [pdf] [2022.6]

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. [blog] [2022.6]

Towards Tracing Factual Knowledge in Language Models Back to the Training Data. [pdf] [EMNLP 2022] [2022.5]

Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. [pdf] [EMNLP 2022] [2022.5]

Large Language Models are Zero-Shot Reasoners. [pdf] [NeurIPS 2022] [2022.5]

Scaling Laws and Interpretability of Learning from Repeated Data. [pdf] [2022.5]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. [pdf] [EMNLP 2022] [2022.3]

In-context Learning and Induction Heads. [blog] [2022.3]

Locating and Editing Factual Associations in GPT. [pdf] [NeurIPS 2022] [2022.2]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? [pdf] [EMNLP 2022] [2022.2]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. [pdf] [2022.1]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. [pdf] [2022.1]

A Mathematical Framework for Transformer Circuits. [blog] [2021.12]

An Explanation of In-context Learning as Implicit Bayesian Inference. [pdf] [ICLR 2022] [2021.11]

Towards a Unified View of Parameter-Efficient Transfer Learning. [pdf] [ICLR 2022] [2021.10]

Do Prompt-Based Models Really Understand the Meaning of their Prompts? [pdf] [NAACL 2022] [2021.9]

Deduplicating Training Data Makes Language Models Better. [pdf] [ACL 2022] [2021.7]

LoRA: Low-Rank Adaptation of Large Language Models. [pdf] [ICLR 2022] [2021.6]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. [pdf] [ACL 2022] [2021.4]

The Power of Scale for Parameter-Efficient Prompt Tuning. [pdf] [EMNLP 2021] [2021.4]

Calibrate Before Use: Improving Few-Shot Performance of Language Models [pdf] [ICML 2021] [2021.2]

Prefix-Tuning: Optimizing Continuous Prompts for Generation. [pdf] [ACL 2021] [2021.1]

Transformer Feed-Forward Layers Are Key-Value Memories. [pdf] [EMNLP 2021] [2020.12]

Scaling Laws for Neural Language Models. [pdf] [2020.1]