sae-transfer: A Python repository from ckkissane

Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models".

Blog post

sae_transfer_script.py contains code to reproduce SAE transfer evaluations on the pile. Example usage:

python sae_transfer_script.py --sae_project mistral-7B-base --model_name mistral-7B-instruct --wandb_entity ckkissane --eval_batch_size_prompts 8 --eval_batches 6 --num_act_batches 10 --ignore_outliers --outlier_threshold 200

alpaca_transfer_script.py contains code to reproduce SAE transfer evaluations on the alpaca. Example usage:

python alpaca_transfer_script.py --chat_model_name mistral-7B-instruct --base_sae_project mistral-7B-base --chat_sae_project mistral-7B-chat --mask_type rollout --num_samples 100 --ignore_outliers --batch_size 1 --outlier_threshold 200

finetuned_sae_evals.py contains code to evaluate the fine-tuned SAEs. Example usage:

python finetuned_sae_evals.py --model_name mistral-7B-instruct --wandb_entity ckkissane --eval_batch_size_prompts 8 --eval_batches 6 --num_act_batches 10

generated_data_*.pkl contain ~50 alpaca instructions / completions generated from each of Mistral-7B Instruct, Gemma v1 2B IT, and Qwen 1.5 0.5B Chat.

We build on code from SAELens and Arditi et al.

Open source SAEs

We open source SAEs used in this work:

ckkissane/sae-transfer

Contents

Open source SAEs