This is the codebase for the Guarantees-Based Mechanistic Interpretability MARS stream. Successor to https://github.com/JasonGross/neural-net-coq-interp.
@misc{gross2024compact,
author = {Jason Gross and Rajashree Agrawal and Thomas Kwa and Euan Ong and Chun Hei Yip and Alex Gibson and Soufiane Noubir and Lawrence Chan},
title = {Compact Proofs of Model Performance via Mechanistic Interpretability},
year = {2024},
month = {June},
doi = {10.48550/arxiv.2406.11779},
eprint = {2406.11779},
url = {https://arxiv.org/abs/2406.11779},
eprinttype = {arXiv},
}
Abstract:
In this work, we propose using mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-K task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
To clone only the main branch (and not the other data-heavy branches), use
git clone --single-branch --branch main https://github.com/JasonGross/guarantees-based-mechanistic-interpretability-with-data.git
cd guarantees-based-mechanistic-interpretability-with-data
git submodule init
etc/setup-alternatives.py
git submodule update --single-branch
The code can be run under any environment with Python 3.9 and above.
We use poetry for dependency management, which can be installed following the instructions here.
To build a virtual environment with the required packages, simply run
poetry config virtualenvs.in-project true
poetry install
Notes
- On some systems you may need to set the environment variable
PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
to avoid keyring-based errors. - The first line tells poetry to create the virtual environment in the project directory, which allows VS Code to find the virtual environment.
- If you are using caches from other machines, if you see errors like "dbm.error: db type is dbm.gnu, but the module is not available", you can probably solve the issue by following instructions from StackOverflow:
sudo apt-get install libgdbm-dev python3-gdbm
- If you are using
conda
or some other Python version management, you can inspect the output ofdpkg -L python3-gdbm
and copy thelib-dynload/_gdbm.cpython-*-x86_64-linux-gnu.so
file to the correspondinglib/
directory associated to the python you are using.
A cache for pre-computed data for the Max-of-K experiments is available on branches of JasonGross/guarantees-based-mechanistic-interpretability-with-data:
max-of-4-cache
max-of-5-cache
max-of-10-cache
max-of-10-dvocab-128-cache
To open a Jupyter notebook, run
poetry run jupyter lab
If this doesn't work (e.g. you have multiple Jupyter kernels already installed on your system), you may need to make a new kernel for this project:
poetry run python -m ipykernel install --user --name=gbmi
Models for existing experiments can be trained by running e.g.
poetry run python -m gbmi.exp_max_of_n.train
or by running e.g.
from gbmi.exp_max_of_n.train import MAX_OF_10_CONFIG
from gbmi.model import train_or_load_model
rundata, model = train_or_load_model(MAX_OF_10_CONFIG)
from a Jupyter notebook.
This function will attempt to pull a trained model with the specified config from Weights and Biases; if such a model does not exist, it will train the relevant model and save the weights to Weights and Biases.
The convention for this codebase is to store experiment-specific code in an exp_[NAME]/
folder, with
exp_[NAME]/analysis.py
storing functions for visualisation / interpretabilityexp_[NAME]/verification.py
storing functions for verificationexp_[NAME]/train.py
storing training / dataset code
See the exp_template
directory for more details.
To add new dependencies, run poetry add my-package
.
We use black to format our code. To set up the pre-commit hooks that enforce code formatting, run
make pre-commit-install
This codebase advocates for expect tests in machine learning, and as such uses @ezyang's expecttest library for unit and regression tests.
[TODO: add tests?]