ckkissane's Stars
openai/transformer-debugger
gpu-mode/lectures
Material for gpu-mode lectures
TransformerLensOrg/TransformerLens
A library for mechanistic interpretability of GPT-style language models
jacobhilton/deep_learning_curriculum
Language model alignment-focused deep learning curriculum
jbloomAus/SAELens
Training Sparse Autoencoders on Language Models
EleutherAI/sae
Sparse autoencoders
openai/sparse_autoencoder
callummcdougall/ARENA_3.0
imbue-ai/cluster-health
TransformerLensOrg/CircuitsVis
Mechanistic Interpretability Visualizations using React
callummcdougall/ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
ai-safety-foundation/sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
HoagyC/sparse_coding
Using sparse coding to find distributed representations used by neural networks.
anthropics/PySvelte
A library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations
callummcdougall/sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
likenneth/othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
saprmarks/dictionary_learning
andyrdt/refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
saprmarks/feature-circuits
nrimsky/LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
wesg52/sparse-probing-paper
Sparse probing paper full code.
EleutherAI/aria
wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
callummcdougall/sae_visualizer
jbloomAus/SAEDashboard
callummcdougall/TransformerLens-intro
callummcdougall/path_patching
Implementation of path patching & activation patching (will eventually add to TransformerLens).
neelnanda-io/Neuroscope
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
callummcdougall/CircuitsVis
Mechanistic Interpretability Visualizations using React
neelnanda-io/Tiny-Stories-SAEs