/mechanisticinterpretability

A repository for awesome resources in mechanistic interpretability

Awesome Mechanistic Interpretability Awesome

A repository for awesome resources in mechanistic interpretability

Mechanistic Interpretability lists

Libraries

  • TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models (Colab)
  • Unseal: Mechanistic Interpretability for Transformers
  • BertViz: BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.

Tools

  • Lexoscope: 6 models with a page per neuron, displaying the top 20 maximum activating dataset examples.
  • exBert: Visual Analysis of Transformer Models (click through the safety popup)

Videos

Core readings