andyrdt

ScrollNew York

Pinned Repositories

andyrdt.github.io
Language:SCSS0 1 00
ARENA_3.0
Language:HTML0 0 00
circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
Language:Jupyter Notebook00
CircuitsVis
Mechanistic Interpretability Visualizations using React
Language:Jupyter Notebook0 0 00
eleutherai_sae
Sparse autoencoders
Language:Python1 0 00
iclr
Language:Jupyter Notebook00
llm-attacks
Language:Python0 0 00
mats_sae_training
Language:Jupyter Notebook1 0 00
mi
Repo to track miscellaneous mi stuff
Language:Jupyter Notebook0 1 03
refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
Language:Python196 4 845

andyrdt's Repositories

andyrdt/refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
Language:Python196 4 845
andyrdt/eleutherai_sae
Sparse autoencoders
Language:Python1 0 00
andyrdt/mats_sae_training
Language:Jupyter Notebook1 0 00
andyrdt/andyrdt.github.io
Language:SCSS0 1 00
andyrdt/ARENA_3.0
Language:HTML0 0 00
andyrdt/circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
Language:Jupyter Notebook00
andyrdt/CircuitsVis
Mechanistic Interpretability Visualizations using React
Language:Jupyter Notebook0 0 00
andyrdt/iclr
Language:Jupyter Notebook00
andyrdt/llm-attacks
Language:Python0 0 00
andyrdt/mi
Repo to track miscellaneous mi stuff
Language:Jupyter Notebook0 1 03
andyrdt/path_patching
Implementation of path patching & activation patching (will eventually add to TransformerLens).
Language:Python0 0 00
andyrdt/SycophancySteering
Modulating sycophancy in llama-2 via activation steering
Language:Python0 0 00
andyrdt/TransformerLens
A library for mechanistic interpretability of GPT-style language models
Language:Python0 0 00
andyrdt/wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining general capabilities.
Language:Jupyter Notebook0 0