2024-08-01 |
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach • Introduces a cost-effective model editing approach focusing on attention heads to enhance semantic consistency in LLMs without extensive parameter changes. • Analyzed attention heads, injected biases, and tested on NLU and NLG datasets. • Achieved notable improvements in semantic consistency and task performance, with strong generalization across additional tasks. |
|
|
2024-07-31 |
Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment • Introduced Negative Attention Score (NAS) to quantify and correct negative bias in language models. • Identified negatively biased attention heads and proposed Negative Attention Score Alignment (NASA) for fine-tuning. • NASA effectively reduced the precision-recall gap while preserving generalization in binary decision tasks. |
|
|
2024-07-29 |
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability • Introduces a method using Mechanistic Interpretability (MI) to detect and understand vulnerabilities in LLMs, particularly adversarial attacks. • Analyzes GPT-2 Small for vulnerabilities in predicting 3-letter acronyms. • Successfully identifies and explains specific vulnerabilities in the model related to the task. |
|
|
2024-07-22 |
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads • Introduced RazorAttention, a training-free KV cache compression technique using retrieval heads and compensation tokens to preserve critical token information. • Evaluated RazorAttention on large language models (LLMs) for efficiency. • Achieved over 70% KV cache size reduction with no noticeable performance impact. |
|
|
2024-07-21 |
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions • The paper introduces vocabulary projection and activation patching to localize hidden states that predict the correct MCQA answers. • Identified key attention heads and layers responsible for answer selection in transformers. • Middle-layer attention heads are crucial for accurate answer prediction, with a sparse set of heads playing unique roles. |
|
|
2024-07-09 |
Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning • The article identifies induction heads as crucial for pattern matching in in-context learning (ICL). • Evaluated Llama-3-8B and InternLM2-20B on abstract pattern recognition and NLP tasks. • Ablating induction heads reduces ICL performance by up to ~32%, bringing it close to random for pattern recognition. |
|
|
2024-07-01 |
Steering Large Language Models for Cross-lingual Information Retrieval • Introduces Activation Steered Multilingual Retrieval (ASMR), using steering activations to guide LLMs for improved cross-lingual information retrieval. • Identified attention heads in LLMs affecting accuracy and language coherence, and applied steering activations. • ASMR achieved state-of-the-art performance on CLIR benchmarks like XOR-TyDi QA and MKQA. |
|
|
2024-06-21 |
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression • The paper introduces Mixture of Attention (MoA), which tailors distinct sparse attention configurations for different heads and layers, optimizing memory, throughput, and accuracy-latency trade-offs. • MoA profiles models, explores attention configurations, and improves LLM compression. • MoA increases effective context length by 3.9×, while reducing GPU memory usage by 1.2-1.4×. |
|
|
2024-06-19 |
On the Difficulty of Faithful Chain-of-Thought Reasoning in Large Language Models • Introduced novel strategies for in-context learning, fine-tuning, and activation editing to improve Chain-of-Thought (CoT) reasoning faithfulness in LLMs. • Tested these strategies across multiple benchmarks to evaluate their effectiveness. • Found only limited success in enhancing CoT faithfulness, highlighting the challenge in achieving truly faithful reasoning in LLMs. |
|
|
2024-05-28 |
Knowledge Circuits in Pretrained Transformers • Introduced "knowledge circuits" in transformers, revealing how specific knowledge is encoded through interaction among attention heads, relation heads, and MLPs. • Analyzed GPT-2 and TinyLLAMA to identify knowledge circuits; evaluated knowledge editing techniques. • Demonstrated how knowledge circuits contribute to model behaviors like hallucinations and in-context learning. |
|
|
2024-05-23 |
Linking In-context Learning in Transformers to Human Episodic Memory • Links in-context learning in Transformer models to human episodic memory, highlighting similarities between induction heads and the contextual maintenance and retrieval (CMR) model. • Analysis of Transformer-based LLMs to demonstrate CMR-like behavior in attention heads. • CMR-like heads emerge in intermediate layers, mirroring human memory biases. |
|
|
2024-05-07 |
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability • First mechanistic interpretability study on GPT-2 for predicting multi-token acronyms using attention heads. • Identified and interpreted a circuit of 8 attention heads responsible for acronym prediction. • Demonstrated that these 8 heads (~5% of total) concentrate the acronym prediction functionality. |
|
|
2024-05-02 |
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation • Introduced an optogenetics-inspired causal framework to study induction head (IH) formation in transformers. • Analyzed IH emergence in transformers using synthetic data and identified three underlying subcircuits responsible for IH formation. • Discovered that these subcircuits interact to drive IH formation, coinciding with a phase change in model loss. |
|
|
2024-04-24 |
Retrieval Head Mechanistically Explains Long-Context Factuality • Identified "retrieval heads" in transformer models responsible for retrieving information across long contexts. • Systematic investigation of retrieval heads across various models, including analysis of their role in chain-of-thought reasoning. • Pruning retrieval heads leads to hallucination, while pruning non-retrieval heads doesn't affect retrieval ability. |
|
|
2024-03-27 |
Non-Linear Inference Time Intervention: Improving LLM Truthfulness • Introduced Non-Linear Inference Time Intervention (NL-ITI), enhancing LLM truthfulness by multi-token probing and intervention without fine-tuning. • Evaluated NL-ITI on multiple-choice datasets, including TruthfulQA. • Achieved a 16% relative improvement in MC1 accuracy on TruthfulQA over baseline ITI. |
|
|
2024-02-28 |
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models • Introduces the PH3 method to prune conflicting attention heads, mitigating knowledge conflicts in language models without parameter updates. • Applied PH3 to control LMs' reliance on internal memory vs. external context and tested its effectiveness on open-domain QA tasks. • PH3 improved internal memory usage by 44.0% and external context usage by 38.5%. |
|
|
2024-02-27 |
Information Flow Routes: Automatically Interpreting Language Models at Scale • Introduces "Information Flow Routes" using attribution for graph-based interpretation of language models, avoiding activation patching. • Experiments with Llama 2, identifying key attention heads and behavior patterns across different domains and tasks. • Uncovered specialized model components; identified consistent roles for attention heads, such as handling tokens of the same part of speech. |
|
|
2024-02-20 |
Identifying Semantic Induction Heads to Understand In-Context Learning • Identifies and studies "semantic induction heads" in large language models (LLMs) that correlate with in-context learning abilities. • Analyzed attention heads for encoding syntactic dependencies and knowledge graph relations. • Certain attention heads enhance output logits by recalling relevant tokens, crucial for understanding in-context learning in LLMs. |
|
|
2024-02-16 |
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains • Introduces a Markov Chain sequence modeling task to analyze how in-context learning (ICL) capabilities emerge in transformers, forming "statistical induction heads." • Empirical and theoretical investigation of multi-phase training in transformers on Markov Chain tasks. • Demonstrates phase transitions from unigram to bigram predictions, influenced by transformer layer interactions. |
|
|
2024-02-11 |
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs • Identifies and explains the "additive motif" in factual recall, where LLMs use multiple independent mechanisms that constructively interfere to recall facts. • Extended direct logit attribution to analyze attention heads and unpacked the behavior of mixed heads. • Demonstrated that factual recall in LLMs results from the sum of multiple, independently insufficient contributions. |
|
|
2024-02-05 |
How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning • Introduces the concept that query and key matrices in in-context heads operate as "two towers" for metric learning, facilitating similarity computation between label features. • Analyzed in-context learning mechanisms; identified specific attention heads crucial for ICL. • Reduced ICL accuracy from 87.6% to 24.4% by intervening in only 1% of these heads. |
|
|
2024-01-16 |
Circuit Component Reuse Across Tasks in Transformer Language Models • The paper demonstrates that specific circuits in GPT-2 can generalize across different tasks, challenging the notion that such circuits are task-specific. • It examines the reuse of circuits from the Indirect Object Identification (IOI) task in the Colored Objects task. • Adjusting four attention heads boosts accuracy from 49.6% to 93.7% in the Colored Objects task. |
|
|
2024-01-16 |
Successor Heads: Recurring, Interpretable Attention Heads In The Wild • The paper introduces "Successor Heads," attention heads in LLMs that increment tokens with natural orderings, like days or numbers. • It analyzes the formation of successor heads across various model sizes and architectures, such as GPT-2 and Llama-2. • Successor heads are found in models ranging from 31M to 12B parameters, revealing abstract, recurring numeric representations. |
|
|
2024-01-16 |
Function Vectors in Large Language Models • The article introduces "Function Vectors (FVs)," compact, causal representations of tasks within autoregressive transformer models. • FVs were tested across diverse in-context learning (ICL) tasks, models, and layers. • FVs can be summed to create vectors that trigger new, complex tasks, demonstrating internal vector composition. |
|
|