Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models".
sae_transfer_script.py
contains code to reproduce SAE transfer evaluations on the pile. Example usage:
python sae_transfer_script.py --sae_project mistral-7B-base --model_name mistral-7B-instruct --wandb_entity ckkissane --eval_batch_size_prompts 8 --eval_batches 6 --num_act_batches 10 --ignore_outliers --outlier_threshold 200
alpaca_transfer_script.py
contains code to reproduce SAE transfer evaluations on the alpaca. Example usage:
python alpaca_transfer_script.py --chat_model_name mistral-7B-instruct --base_sae_project mistral-7B-base --chat_sae_project mistral-7B-chat --mask_type rollout --num_samples 100 --ignore_outliers --batch_size 1 --outlier_threshold 200
finetuned_sae_evals.py
contains code to evaluate the fine-tuned SAEs. Example usage:
python finetuned_sae_evals.py --model_name mistral-7B-instruct --wandb_entity ckkissane --eval_batch_size_prompts 8 --eval_batches 6 --num_act_batches 10
generated_data_*.pkl
contain ~50 alpaca instructions / completions generated from each of Mistral-7B Instruct, Gemma v1 2B IT, and Qwen 1.5 0.5B Chat.
We build on code from SAELens and Arditi et al.
We open source SAEs used in this work: