Pinned Repositories
attention-output-saes
Code to reproduce key results for "Interpreting Attention Layer Outputs with Sparse Autoencoders"
base-models-refuse
Code to reproduce key results accompanying "Base LLMs refuse too"
crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
deep_learning_curriculum
Language model alignment-focused deep learning curriculum
rlhf-shakespeare
Shakespeare transformer fine-tuned to generate positive sentiment samples using RLHF
sae-dataset-dependence
sae-transfer
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
shakespeare-transformer
Decoder only transformer trained on the works of Shakespeare
TransformerLens
ckkissane's Repositories
ckkissane/crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
ckkissane/rlhf-shakespeare
Shakespeare transformer fine-tuned to generate positive sentiment samples using RLHF
ckkissane/sae-transfer
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
ckkissane/sae-dataset-dependence
ckkissane/attention-output-saes
Code to reproduce key results for "Interpreting Attention Layer Outputs with Sparse Autoencoders"
ckkissane/deep_learning_curriculum
Language model alignment-focused deep learning curriculum
ckkissane/base-models-refuse
Code to reproduce key results accompanying "Base LLMs refuse too"
ckkissane/shakespeare-transformer
Decoder only transformer trained on the works of Shakespeare
ckkissane/sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
ckkissane/optimizers-from-scratch
Implementations of popular optimizers in Pytorch
ckkissane/1L-Sparse-Autoencoder
ckkissane/ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
ckkissane/TransformerLens
ckkissane/attention-head-wiki
ckkissane/attn-sae-gelu-2l-viz
ckkissane/attn-sae-gpt2-small-viz
ckkissane/august-monthly-challenge
ckkissane/CircuitsVis
Mechanistic Interpretability Visualizations using React
ckkissane/induction-heads-transformer-lens
Replication of induction heads phase change results using TransformerLens and PyTorch
ckkissane/jax
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
ckkissane/mech-interp-practice
Collection of mechanistic interpretability practice problems with accompanying tutorials
ckkissane/micrograd-tensor
Extension of micrograd. Uses Tensors instead of Values
ckkissane/minitorch
The full minitorch student suite.
ckkissane/Neuroscope
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
ckkissane/othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
ckkissane/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration
ckkissane/sae_visualizer
ckkissane/SAELens
Training Sparse Autoencoders on Language Models
ckkissane/sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
ckkissane/sparse_coding
Using sparse coding to find distributed representations used by neural networks.