/sae-transfer

Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"

Primary LanguagePython

Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models".

Contents

  • sae_transfer_script.py contains code to reproduce SAE transfer evaluations on the pile. Example usage:

python sae_transfer_script.py --sae_project mistral-7B-base --model_name mistral-7B-instruct --wandb_entity ckkissane --eval_batch_size_prompts 8 --eval_batches 6 --num_act_batches 10 --ignore_outliers --outlier_threshold 200

  • alpaca_transfer_script.py contains code to reproduce SAE transfer evaluations on the alpaca. Example usage:

python alpaca_transfer_script.py --chat_model_name mistral-7B-instruct --base_sae_project mistral-7B-base --chat_sae_project mistral-7B-chat --mask_type rollout --num_samples 100 --ignore_outliers --batch_size 1 --outlier_threshold 200

  • finetuned_sae_evals.py contains code to evaluate the fine-tuned SAEs. Example usage:

python finetuned_sae_evals.py --model_name mistral-7B-instruct --wandb_entity ckkissane --eval_batch_size_prompts 8 --eval_batches 6 --num_act_batches 10

  • generated_data_*.pkl contain ~50 alpaca instructions / completions generated from each of Mistral-7B Instruct, Gemma v1 2B IT, and Qwen 1.5 0.5B Chat.

We build on code from SAELens and Arditi et al.

Open source SAEs

We open source SAEs used in this work: