mechanistic-interpretability

There are 55 repositories under mechanistic-interpretability topic.

stanfordnlp/pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Language:Python806 8 6690
ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
372 6 426
OpenMOSS/Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
Language:Python146 5 2118
MadryLab/modelcomponents
Decomposing and Editing Predictions by Modeling Model Computation
Language:Jupyter Notebook138 3 18
stanfordnlp/axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Language:Python128 3 1015
steering-vectors/steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
Language:Python93 1 298
pauljblazek/deepdistilling
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Language:Python92 3 17
jbloomAus/DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
Language:Jupyter Notebook87 3 7719
epfl-dlab/llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Language:Jupyter Notebook79 3 318
apartresearch/interpretability-starter
🧠 Starter templates for doing interpretability research
73 0 02
taufeeque9/codebook-features
Sparse and discrete interpretability tool for neural networks
Language:Python63 4 44
wesg52/sparse-probing-paper
Sparse probing paper full code.
Language:Jupyter Notebook60 2 211
microsoft/automated-explanations
Generating and validating natural-language explanations.
Language:HTML48 7 56
aryamanarora/causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Language:Python47 1 57
wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
Language:Jupyter Notebook27 3 26
yash-srivastava19/arrakis
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
Language:Jupyter Notebook26 1 41
Nix07/finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
Language:Jupyter Notebook25 1 04
tim-lawson/mlsae
Multi-Layer Sparse Autoencoders (ICLR 2025)
Language:Python24 2 60
koayon/atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Language:Python19 3 11
BatsResearch/cross-lingual-detox
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
Language:Jupyter Notebook17 1 00
lkopf/cosy
[NeurIPS 2024] CoSy is an automatic evaluation framework for textual explanations of neurons.
Language:Jupyter Notebook17 3 12
koayon/awesome-sparse-autoencoders
A curated reading list of research in Sparse Autoencoders, Feature Extraction and related topics in Mechanistic Interpretability
16 1 02
evan-lloyd/graphpatch
graphpatch is a library for activation patching on PyTorch neural network models.
Language:Python13 2 00
Zhaoyi-Li21/creme
[ACL'2024 Findings] "Understanding and Patching Compositional Reasoning in LLMs"
Language:Python12 1 00
Butanium/nnterp
A small package implementing some useful wrapping around nnsight
Language:Python10 1 42
francescortu/comp-mech
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Language:Python10 1 11
Ki-Seki/Awesome-Transformer-Visualization
Explore visualization tools for understanding Transformer-based large language models (LLMs)
10 1 02
apartresearch/deepdecipher
🦠 DeepDecipher: An open source API to MLP neurons
Language:Rust9 2 1010
DeanHazineh/Emergent-World-Representations-Othello
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
Language:Jupyter Notebook7 1 13
aarnphm/morph
exploration WYSIWYG editor
Language:TypeScript6 1 1120
chrisliu298/awesome-sparse-autoencoders
A resource repository of sparse autoencoders for large language models
6 1 00
zroe1/toy-models-of-superposition
A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.
Language:Jupyter Notebook6 1 00
Butanium/llm-lang-agnostic
minimal code to reproduce results from Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
Language:Jupyter Notebook5 1 01
THU-KEG/SafetyNeuron
Data and code for the paper: Finding Safety Neurons in Large Language Models
Language:Jupyter Notebook5 4 1
tegridydev/mechamap
MechaMap - Toolkit for Mechanistic Interpretability (MI) Research
Language:Python2 1 00
tegridydev/mixture-of-persona-research
A “Mixture of Perspectives” Framework for Ethical AI
1 1 00

mechanistic-interpretability

stanfordnlp/pyvene

ruizheliUOA/Awesome-Interpretability-in-Large-Language-Models

OpenMOSS/Language-Model-SAEs

MadryLab/modelcomponents

stanfordnlp/axbench

steering-vectors/steering-vectors

pauljblazek/deepdistilling

jbloomAus/DecisionTransformerInterpretability

epfl-dlab/llm-latent-language

apartresearch/interpretability-starter

taufeeque9/codebook-features

wesg52/sparse-probing-paper

microsoft/automated-explanations

aryamanarora/causalgym

wesg52/universal-neurons

yash-srivastava19/arrakis

Nix07/finetuning

tim-lawson/mlsae

koayon/atp_star

BatsResearch/cross-lingual-detox

lkopf/cosy

koayon/awesome-sparse-autoencoders

evan-lloyd/graphpatch

Zhaoyi-Li21/creme

Butanium/nnterp

francescortu/comp-mech

Ki-Seki/Awesome-Transformer-Visualization

apartresearch/deepdecipher

DeanHazineh/Emergent-World-Representations-Othello

aarnphm/morph

chrisliu298/awesome-sparse-autoencoders

zroe1/toy-models-of-superposition

Butanium/llm-lang-agnostic

THU-KEG/SafetyNeuron

tegridydev/mechamap

tegridydev/mixture-of-persona-research