Pinned Repositories
attention-output-saes
Code to reproduce key results for "Interpreting Attention Layer Outputs with Sparse Autoencoders"
base-models-refuse
Code to reproduce key results accompanying "Base LLMs refuse too"
crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
deep_learning_curriculum
Language model alignment-focused deep learning curriculum
rlhf-shakespeare
Shakespeare transformer fine-tuned to generate positive sentiment samples using RLHF
sae-dataset-dependence
sae-transfer
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
shakespeare-transformer
Decoder only transformer trained on the works of Shakespeare
TransformerLens
ckkissane's Repositories
ckkissane/numpy
The fundamental package for scientific computing with Python.