Awesome Mechanistic Interpretability

A repository for awesome resources in mechanistic interpretability

Mechanistic Interpretability lists

Neel Nanda's opinionated list of readings: A list of Neel's favourite papers on the topic
The Interpretability Playground: A large resources list for safety-minded interpretability research
AI Safety Ideas mechanistic interpretability research list: A list of research ideas in mechanistic interpretability

TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models (Colab)
Unseal: Mechanistic Interpretability for Transformers
BertViz: BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.

Lexoscope: 6 models with a page per neuron, displaying the top 20 maximum activating dataset examples.
exBert: Visual Analysis of Transformer Models (click through the safety popup)