dtch1997
Mechanistic interpretability researcher. Interested in interpreting multimodal foundation models
Pinned Repositories
ASE
corl2023_rl_cbf
Code accompanying the submission: "Your Value Function is a Control Barrier Function: Verication of Learned Policies using Control Theory"
CrowdHuman-dataset-prep
A repository to download and prepare CrowdHuman dataset for training in PyTorch
IsaacGymEnvs
AMP implementation for quadruped legged robot in IsaacGymEnvs
quadruped-gym
An OpenAI gym environment for the training of legged robots
reasoning-bench
A collection of reasoning benchmarks for LLMs
rl_cbf
Code accompanying "Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory"
sae-probe
Investigating the feasibility of using SAE features as a basis for sparse reconstructions of linear probes
steering-bench
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
tms-kit
Toy models of superposition
dtch1997's Repositories
dtch1997/steering-bench
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
dtch1997/sae-probe
Investigating the feasibility of using SAE features as a basis for sparse reconstructions of linear probes
dtch1997/tms-kit
Toy models of superposition
dtch1997/repepo
Codebase for the NeurIPS 2024 paper: "Analyzing the Generalization and Reliability of Steering Vectors"
dtch1997/advprompter
dtch1997/feature-lens
Visualizing SAE features in terms of their upstream and downstream features
dtch1997/feature_composition
Experiments on feature composition in toy models and SAEs
dtch1997/reasoning-bench
A collection of reasoning benchmarks for LLMs
dtch1997/sae-eap
Edge attribution patching with SAEs
dtch1997/agg-lms
Codebase for LLMs where logits are decoded from an aggregate of all layers
dtch1997/assets
random things I need hosted publicly and convenient to `wget`
dtch1997/auto-circuit
A library for efficient patching and automatic circuit discovery.
dtch1997/BALROG
Benchmarking Agentic LLM and VLM Reasoning On Games
dtch1997/belief-state-superposition
A repository for training transformers with belief states
dtch1997/candor-bench
Situational awareness evaluation for AI models
dtch1997/circuit-finder
dtch1997/dtch1997.github.io
dtch1997/eindex
My interpretation of what einops indexing would look like (created to work on during my SERI MATS project).
dtch1997/gemma-refusal-circuit
Hacky attempts to find a mechanistic explanation of refusal in Gemma 2b IT
dtch1997/learned-planner
Interp tools for recurrent networks that play Sokoban
dtch1997/llm-rules
dtch1997/nanoGPT
The simplest, fastest repository for training/finetuning medium-sized GPTs.
dtch1997/nanoSAE
Minmal repository for training SAEs
dtch1997/protein-model-steering
dtch1997/sae-attrib-lens
dtch1997/sae-dream
Synthetic max-activating examples for SAE features generated with EPO
dtch1997/SAELens
Training Sparse Autoencoders on Language Models
dtch1997/smol-sae
dtch1997/stock-images
A collection of stock images for doing vision interp
dtch1997/transcoders-slim
A minimal implementation of transcoders