Pinned Repositories
andyrdt.github.io
ARENA_3.0
circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
CircuitsVis
Mechanistic Interpretability Visualizations using React
eleutherai_sae
Sparse autoencoders
iclr
llm-attacks
mats_sae_training
mi
Repo to track miscellaneous mi stuff
refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
andyrdt's Repositories
andyrdt/refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
andyrdt/eleutherai_sae
Sparse autoencoders
andyrdt/mats_sae_training
andyrdt/andyrdt.github.io
andyrdt/ARENA_3.0
andyrdt/circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
andyrdt/CircuitsVis
Mechanistic Interpretability Visualizations using React
andyrdt/iclr
andyrdt/llm-attacks
andyrdt/mi
Repo to track miscellaneous mi stuff
andyrdt/path_patching
Implementation of path patching & activation patching (will eventually add to TransformerLens).
andyrdt/SycophancySteering
Modulating sycophancy in llama-2 via activation steering
andyrdt/TransformerLens
A library for mechanistic interpretability of GPT-style language models
andyrdt/wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.