Focusing on: understanding the internal mechanism of large language models (LLM). (keep updating ...)
Conference paper recommendation: please contact me.
https://transformer-circuits.pub/2023/interpretability-dreams/index.html
https://www.lesswrong.com/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis. [pdf] [EMNLP 2024] [2024.9]
Scaling and evaluating sparse autoencoders. [pdf] [OpenAI] [2024.6]
How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning. [pdf] [EMNLP 2024] [2024.6]
Neuron-Level Knowledge Attribution in Large Language Models. [pdf] [EMNLP 2024] [2024.6]
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. [blog] [Anthropic] [2023.10]
Locating and Editing Factual Associations in Mamba. [pdf] [COLM 2024] [2024.4]
Chain-of-Thought Reasoning Without Prompting. [pdf] [Deepmind] [2024.2]
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. [pdf] [ICLR 2024] [2024.2]
Long-form evaluation of model editing. [pdf] [NAACL 2024] [2024.2]
What does the Knowledge Neuron Thesis Have to do with Knowledge? [pdf] [ICLR 2024] [2023.11]
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. [pdf] [ICLR 2024] [2023.11]
Interpreting CLIP's Image Representation via Text-Based Decomposition. [pdf] [ICLR 2024] [2023.10]
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. [pdf] [ICLR 2024] [2023.10]
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level. [blog] [Deepmind] [2023.12]
Successor Heads: Recurring, Interpretable Attention Heads In The Wild. [pdf] [ICLR 2024] [2023.12]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. [blog] [Anthropic] [2023.10]
Impact of Co-occurrence on Factual Knowledge of Large Language Models. [pdf] [EMNLP 2023] [2023.10]
Function vectors in large language models. [pdf] [ICLR 2024] [2023.10]
Can Large Language Models Explain Themselves? [pdf] [2023.10]
Neurons in Large Language Models: Dead, N-gram, Positional. [pdf] [ACL 2024] [2023.9]
Sparse Autoencoders Find Highly Interpretable Features in Language Models. [pdf] [ICLR 2024] [2023.9]
Do Machine Learning Models Memorize or Generalize? [blog] [2023.8]
Overthinking the Truth: Understanding how Language Models Process False Demonstrations. [pdf] [2023.7]
Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. [pdf] [EMNLP 2023 best paper] [2023.5]
Let's Verify Step by Step. [pdf] [ICLR 2024] [2023.5]
What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning. [pdf] [ACL 2023] [2023.5]
Language models can explain neurons in language models. [blog] [OpenAI] [2023.5]
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis [pdf] [EMNLP 2023] [2023.5]
Dissecting Recall of Factual Associations in Auto-Regressive Language Models. [pdf] [EMNLP 2023] [2023.4]
Are Emergent Abilities of Large Language Models a Mirage? [pdf] [NeurIPS 2023 best paper] [2023.4]
The Closeness of In-Context Learning and Weight Shifting for Softmax Regression. [pdf] [2023.4]
Towards automated circuit discovery for mechanistic interpretability. [pdf] [NeurIPS 2023] [2023.4]
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. [pdf] [NeurIPS 2023] [2023.4]
A Theory of Emergent In-Context Learning as Implicit Structure Induction. [pdf] [2023.3]
Larger language models do in-context learning differently. [pdf] [Google Research] [2023.3]
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. [pdf] [NeurIPs 2023] [2023.1]
Transformers as Algorithms: Generalization and Stability in In-context Learning. [pdf] [ICML 2023] [2023.1]
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. [pdf] [ACL 2023] [2022.12]
How does gpt obtain its ability? tracing emergent abilities of language models to their sources. [blog] [2022.12]
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. [pdf] [ACL 2023] [2022.12]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. [pdf] [ICLR 2023] [2022.11]
Inverse scaling can become U-shaped. [pdf] [EMNLP 2023] [2022.11]
What learning algorithm is in-context learning? Investigations with linear models. [pdf] [ICLR 2023] [2022.11]
Mass-Editing Memory in a Transformer. [pdf] [ICLR 2023] [2022.10]
Polysemanticity and Capacity in Neural Networks. [pdf] [2022.10]
Analyzing Transformers in Embedding Space. [pdf] [ACL 2023] [2022.9]
Toy Models of Superposition. [blog] [Anthropic] [2022.9]
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango. [pdf] [2022.9]
Emergent Abilities of Large Language Models. [pdf] [Google Research] [2022.6]
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. [blog] [Anthropic] [2022.6]
Towards Tracing Factual Knowledge in Language Models Back to the Training Data. [pdf] [EMNLP 2022] [2022.5]
Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. [pdf] [EMNLP 2022] [2022.5]
Large Language Models are Zero-Shot Reasoners. [pdf] [NeurIPS 2022] [2022.5]
Scaling Laws and Interpretability of Learning from Repeated Data. [pdf] [Anthropic] [2022.5]
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. [pdf] [EMNLP 2022] [2022.3]
In-context Learning and Induction Heads. [blog] [Anthropic] [2022.3]
Locating and Editing Factual Associations in GPT. [pdf] [NeurIPS 2022] [2022.2]
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? [pdf] [EMNLP 2022] [2022.2]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. [pdf] [OpenAI & Google] [2022.1]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. [pdf] [NeurIPS 2022] [2022.1]
A Mathematical Framework for Transformer Circuits. [blog] [Anthropic] [2021.12]
An Explanation of In-context Learning as Implicit Bayesian Inference. [pdf] [ICLR 2022] [2021.11]
Towards a Unified View of Parameter-Efficient Transfer Learning. [pdf] [ICLR 2022] [2021.10]
Do Prompt-Based Models Really Understand the Meaning of their Prompts? [pdf] [NAACL 2022] [2021.9]
Deduplicating Training Data Makes Language Models Better. [pdf] [ACL 2022] [2021.7]
LoRA: Low-Rank Adaptation of Large Language Models. [pdf] [ICLR 2022] [2021.6]
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. [pdf] [ACL 2022] [2021.4]
The Power of Scale for Parameter-Efficient Prompt Tuning. [pdf] [EMNLP 2021] [2021.4]
Calibrate Before Use: Improving Few-Shot Performance of Language Models [pdf] [ICML 2021] [2021.2]
Prefix-Tuning: Optimizing Continuous Prompts for Generation. [pdf] [ACL 2021] [2021.1]
Transformer Feed-Forward Layers Are Key-Value Memories. [pdf] [EMNLP 2021] [2020.12]
Scaling Laws for Neural Language Models. [pdf] [OpenAI] [2020.1]
Mechanistic Interpretability for AI Safety A Review. [pdf] [2024.8]
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. [pdf] [2024.7]
Internal Consistency and Self-Feedback in Large Language Models: A Survey. [pdf] [2024.7]
A Primer on the Inner Workings of Transformer-based Language Models. [pdf] [2024.5] [interpretability]
Usable XAI: 10 strategies towards exploiting explainability in the LLM era. [pdf] [2024.3] [interpretability]
A Comprehensive Overview of Large Language Models. [pdf] [2023.12] [LLM]
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. [pdf] [2023.11] [hallucination]
A Survey of Large Language Models. [pdf] [2023.11] [LLM]
Explainability for Large Language Models: A Survey. [pdf] [2023.11] [interpretability]
A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. [pdf] [2023.10] [chain of thought]
Instruction tuning for large language models: A survey. [pdf] [2023.10] [instruction tuning]
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning. [pdf] [2023.9] [instruction tuning]
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. [pdf] [2023.9] [hallucination]
Reasoning with language model prompting: A survey. [pdf] [2023.9] [reasoning]
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. [pdf] [2023.8] [interpretability]
A Survey on In-context Learning. [pdf] [2023.6] [in-context learning]
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. [pdf] [2023.3] [parameter-efficient fine-tuning]
-
https://github.com/ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models (interpretability)
-
https://github.com/cooperleong00/Awesome-LLM-Interpretability?tab=readme-ov-file (interpretability)
-
https://github.com/JShollaj/awesome-llm-interpretability (interpretability)
-
https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (attention)
-
https://github.com/zjunlp/KnowledgeEditingPapers (model editing)