This GitHub will list several papers that align with one of my research focuses. This repo will be occasionally updated as i found interesting stuffs.
- How do Large Language Models Handle Multilingualism?
- The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
- interpreting GPT: the logit lens
- DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
- Eliciting Latent Predictions from Transformers with the Tuned Lens
- Open Source Automated Interpretability for Sparse Autoencoder Features
- On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
- CausalGym
- Probing the Emergence of Cross-lingual Alignment during LLM Training
- Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
- Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
- Influence functions - why, what and how
- Information Flow Routes: Automatically Interpreting Language Models at Scale
- Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals
- Locating and Editing Factual Associations in GPT
- Mass Editing Memory in a Transformer
- Cross-Lingual Knowledge Editing in Large Language Models
- Locating and Editing Factual Associations in Mamba.
- Adversarial Concept Erasure in Kernel Space
- Linear Adversarial Concept Erasure
- Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
- How do Language Models Bind Entities in Context?
- Language Models as Knowledge Bases?
- Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models
- Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models