/TransformerLens

Primary LanguagePythonMIT LicenseMIT

TransformerLens

TransformerLens

Pypi

This library is maintained by Joseph Bloom and was created by Neel Nanda

Installation

Install: pip install transformer_lens

import transformer_lens

# Load a model (eg GPT-2 Small)
model = transformer_lens.HookedTransformer.from_pretrained("gpt2-small")

# Run the model and get logits and activations
logits, activations = model.run_with_cache("Hello World")

Key Tutorials

A Library for Mechanistic Interpretability of Generative Language Models

This is a library for doing mechanistic interpretability of GPT-2 Style language models. The goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights. It is a fact about the world today that we have computer programs that can essentially speak English at a human level (GPT-3, PaLM, etc), yet we have no idea how they work nor how to write one ourselves. This offends me greatly, and I would like to solve this!

TransformerLens lets you load in an open source language model, like GPT-2, and exposes the internal activations of the model to you. You can cache any internal activation in the model, and add in functions to edit, remove or replace these activations as the model runs. The core design principle I've followed is to enable exploratory analysis. One of the most fun parts of mechanistic interpretability compared to normal ML is the extremely short feedback loops! The point of this library is to keep the gap between having an experiment idea and seeing the results as small as possible, to make it easy for research to feel like play and to enter a flow state. Part of what I aimed for is to make my experience of doing research easier and more fun, hopefully this transfers to you!

Gallery

Research done involving TransformerLens:

User contributed examples of the library being used in action:

Check out our demos folder for more examples of TransformerLens in practice

Getting Started in Mechanistic Interpretability

Mechanistic interpretability is a very young and small field, and there are a lot of open problems. This means there's both a lot of low-hanging fruit, and that the bar for entry is low - if you would like to help, please try working on one! The standard answer to "why has no one done this yet" is just that there aren't enough people! Key resources:

Support & Community

If you have issues, questions, feature requests or bug reports, please search the issues to check if it's already been answered, and if not please raise an issue!

You're also welcome to join the open source mech interp community on Slack! Please use issues for concrete discussions about the package, and Slack for higher bandwidth discussions about eg supporting important new use cases, or if you want to make substantial contributions to the library and want a maintainer's opinion. We'd also love for you to come and share your projects on the Slack!

We're particularly excited to support grad students and professional researchers using TransformerLens for their work, please have a low bar for reaching out if there's ways we could better support your use case!

Background

I (Neel Nanda) used to work for the Anthropic interpretability team, and I wrote this library because after I left and tried doing independent research, I got extremely frustrated by the state of open source tooling. There's a lot of excellent infrastructure like HuggingFace and DeepSpeed to use or train models, but very little to dig into their internals and reverse engineer how they work. This library tries to solve that, and to make it easy to get into the field even if you don't work at an industry org with real infrastructure! One of the great things about mechanistic interpretability is that you don't need large models or tons of compute. There are lots of important open problems that can be solved with a small model in a Colab notebook!

The core features were heavily inspired by the interface to Anthropic's excellent Garcon tool. Credit to Nelson Elhage and Chris Olah for building Garcon and showing me the value of good infrastructure for enabling exploratory research!

Interacting with the code / Contributing

Advice for Reading the Code

One significant design decision made was to have a single transformer implementation that could support a range of subtly different GPT-style models. This has the upside of interpretability code just working for arbitrary models when you change the model name in HookedTransformer.from_pretrained! But it has the significant downside that the code implementing the model (in HookedTransformer.py and components.py) can be difficult to read. I recommend starting with my Clean Transformer Demo, which is a clean, minimal implementation of GPT-2 with the same internal architecture and activation names as HookedTransformer, but is significantly clearer and better documented.

DevContainer

For a one-click setup of your development environment, this project includes a DevContainer. It can be used locally with VS Code or with GitHub Codespaces.

Manual Setup

This project uses Poetry for package management. Install as follows (this will also setup your virtual environment):

poetry config virtualenvs.in-project true
poetry install --with dev

Optionally, if you want Jupyter Lab you can run poetry run pip install jupyterlab (to install in the same virtual environment), and then run with poetry run jupyter lab.

Then the library can be imported as import transformer_lens.

Testing

If adding a feature, please add unit tests for it to the tests folder, and check that it hasn't broken anything major using the existing tests (install pytest and run it in the root TransformerLens/ directory).

Running the tests

  • All tests via make test
  • Unit tests only via make unit-test
  • Acceptance tests only via make acceptance-test

Formatting

This project uses pycln, isort and black for formatting, pull requests are checked in github actions.

  • Format all files via make format
  • Only check the formatting via make check-format

Demos

If adding a feature, please add it to the demo notebook in the demos folder, and check that it works in the demo format. This can be tested by replacing pip install git+https://github.com/neelnanda-io/TransformerLens.git with pip install git+https://github.com/<YOUR_USERNAME_HERE>/TransformerLens.git in the demo notebook, and running it in a fresh environment.

Citation

Please cite this library as:

@misc{nandatransformerlens2022,
    title  = {TransformerLens},
    author = {Nanda, Neel and Bloom, Joseph},
    url    = {https://github.com/neelnanda-io/TransformerLens},
    year   = {2022}
}