dtch1997

Mechanistic interpretability researcher. Interested in interpreting multimodal foundation models

Pinned Repositories

ASE
Language:Python3 0 00
corl2023_rl_cbf
Code accompanying the submission: "Your Value Function is a Control Barrier Function: Verication of Learned Policies using Control Theory"
Language:Python4 1 00
CrowdHuman-dataset-prep
A repository to download and prepare CrowdHuman dataset for training in PyTorch
Language:Python5 1 01
IsaacGymEnvs
AMP implementation for quadruped legged robot in IsaacGymEnvs
Language:Python13 1 21
quadruped-gym
An OpenAI gym environment for the training of legged robots
Language:Jupyter Notebook9 2 00
reasoning-bench
A collection of reasoning benchmarks for LLMs
Language:Python10
rl_cbf
Code accompanying "Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory"
Language:Python21 4 20
sae-probe
Investigating the feasibility of using SAE features as a basis for sparse reconstructions of linear probes
Language:Python4 3 01
steering-bench
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
Language:Python5 1 10
tms-kit
Toy models of superposition
Language:HTML3 2 70

dtch1997's Repositories

dtch1997/steering-bench
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
Language:Python5 1 10
dtch1997/sae-probe
Investigating the feasibility of using SAE features as a basis for sparse reconstructions of linear probes
Language:Python4 3 01
dtch1997/tms-kit
Toy models of superposition
Language:HTML3 2 70
dtch1997/repepo
Codebase for the NeurIPS 2024 paper: "Analyzing the Generalization and Reliability of Steering Vectors"
Language:Jupyter Notebook2 3 770
dtch1997/advprompter
Language:Python1 1 0
dtch1997/feature-lens
Visualizing SAE features in terms of their upstream and downstream features
Language:HTML1 1 00
dtch1997/feature_composition
Experiments on feature composition in toy models and SAEs
Language:Python1 1 0
dtch1997/reasoning-bench
A collection of reasoning benchmarks for LLMs
Language:Python10
dtch1997/sae-eap
Edge attribution patching with SAEs
Language:Jupyter Notebook1 1 8
dtch1997/agg-lms
Codebase for LLMs where logits are decoded from an aggregate of all layers
Language:Python
dtch1997/assets
random things I need hosted publicly and convenient to `wget`
dtch1997/auto-circuit
A library for efficient patching and automatic circuit discovery.
Language:Python0 0
dtch1997/BALROG
Benchmarking Agentic LLM and VLM Reasoning On Games
dtch1997/belief-state-superposition
A repository for training transformers with belief states
Language:Python
dtch1997/candor-bench
Situational awareness evaluation for AI models
Language:Python
dtch1997/circuit-finder
Language:HTML
dtch1997/dtch1997.github.io
Language:Python1 0
dtch1997/eindex
My interpretation of what einops indexing would look like (created to work on during my SERI MATS project).
Language:Python
dtch1997/gemma-refusal-circuit
Hacky attempts to find a mechanistic explanation of refusal in Gemma 2b IT
Language:Jupyter Notebook
dtch1997/learned-planner
Interp tools for recurrent networks that play Sokoban
Language:Python
dtch1997/llm-rules
Language:Python1 0
dtch1997/nanoGPT
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Language:Python
dtch1997/nanoSAE
Minmal repository for training SAEs
Language:Python1 0
dtch1997/protein-model-steering
Language:Python1 0
dtch1997/sae-attrib-lens
Language:Python1 0
dtch1997/sae-dream
Synthetic max-activating examples for SAE features generated with EPO
Language:Python1 0
dtch1997/SAELens
Training Sparse Autoencoders on Language Models
Language:HTML0 0
dtch1997/smol-sae
Language:Python1 0
dtch1997/stock-images
A collection of stock images for doing vision interp
dtch1997/transcoders-slim
A minimal implementation of transcoders
Language:Python