A sparse autoencoder for mechanistic interpretability research.
Train a Sparse Autoencoder in colab, or install for your project:
pip install sparse_autoencoder
This library contains:
- A sparse autoencoder model, along with all the underlying PyTorch components you need to
customise and/or build your own:
- Encoder, constrained unit norm decoder and tied bias PyTorch modules in
autoencoder
. - L1 and L2 loss modules in
loss
. - Adam module with helper method to reset state in
optimizer
.
- Encoder, constrained unit norm decoder and tied bias PyTorch modules in
- Activations data generator using TransformerLens, with the underlying steps in case you
want to customise the approach:
- Activation store options (in-memory or on disk) in
activation_store
. - Hook to get the activations from TransformerLens in an efficient way in
source_model
. - Source dataset (i.e. prompts to generate these activations) utils in
source_data
, that stream data from HuggingFace and pre-process (tokenize & shuffle).
- Activation store options (in-memory or on disk) in
- Activation resampler to help reduce the number of dead neurons.
- Metrics that log at various stages of training (e.g. during training, resampling and validation), and integrate with wandb.
- Training pipeline that combines everything together, allowing you to run hyperparameter sweeps and view progress on wandb.
The library is designed to be modular. By default it takes the approach from Towards
Monosemanticity: Decomposing Language Models With Dictionary Learning
, so you can pip install
the library and get started quickly. Then when you need to customise something, you can just extend
the class for that component (e.g. you can extend SparseAutoencoder
if you want to customise the
model, and then drop it back into the training pipeline. Every component is fully documented, so
it's nice and easy to do this.
Check out the demo notebook docs/content/demo.ipynb for a guide to using this library.
This project uses Poetry for dependency management, and
PoeThePoet for scripts. After checking out the repo,
we recommend setting poetry's config to create the .venv
in the root directory (note this is a
global setting) and then installing with the dev and demos dependencies.
poetry config virtualenvs.in-project true
poetry install --with dev,demos
For a full list of available commands (e.g. test
or typecheck
), run this in your terminal
(assumes the venv is active already).
poe