/sae-rm

Using SAE's to interpret Reward Models (RM)

Primary LanguageJupyter Notebook

Use custom_prompts.ipynb to find important features for various SAE layers for a given prompt.

Note: we use attribution patching w/ integrated gradients. You might run out of GPU memory. Keep prompts short or integrated gradient steps small to avoid OOM. 45GB recommended.