mechanistic-interpretability
There are 55 repositories under mechanistic-interpretability topic.
stanfordnlp/pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
OpenMOSS/Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
MadryLab/modelcomponents
Decomposing and Editing Predictions by Modeling Model Computation
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
steering-vectors/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
pauljblazek/deepdistilling
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
jbloomAus/DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
epfl-dlab/llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
apartresearch/interpretability-starter
đź§ Starter templates for doing interpretability research
taufeeque9/codebook-features
Sparse and discrete interpretability tool for neural networks
wesg52/sparse-probing-paper
Sparse probing paper full code.
microsoft/automated-explanations
Generating and validating natural-language explanations.
aryamanarora/causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
yash-srivastava19/arrakis
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
Nix07/finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
tim-lawson/mlsae
Multi-Layer Sparse Autoencoders (ICLR 2025)
koayon/atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
BatsResearch/cross-lingual-detox
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
lkopf/cosy
[NeurIPS 2024] CoSy is an automatic evaluation framework for textual explanations of neurons.
koayon/awesome-sparse-autoencoders
A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability
evan-lloyd/graphpatch
graphpatch is a library for activation patching on PyTorch neural network models.
Zhaoyi-Li21/creme
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
Butanium/nnterp
A small package implementing some useful wrapping around nnsight
francescortu/comp-mech
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Ki-Seki/Awesome-Transformer-Visualization
Explore visualization tools for understanding Transformer-based large language models (LLMs)
apartresearch/deepdecipher
🦠DeepDecipher: An open source API to MLP neurons
DeanHazineh/Emergent-World-Representations-Othello
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
aarnphm/morph
exploration WYSIWYG editor
chrisliu298/awesome-sparse-autoencoders
A resource repository of sparse autoencoders for large language models
zroe1/toy-models-of-superposition
A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.
Butanium/llm-lang-agnostic
minimal code to reproduce results from Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
THU-KEG/SafetyNeuron
Data and code for the paper: Finding Safety Neurons in Large Language Models
tegridydev/mechamap
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
tegridydev/mixture-of-persona-research
A “Mixture of Perspectives” Framework for Ethical AI