Investigating the generalization behavior of LM probes trained to Elicit Latent Knowledge.
- from truthful to untruthful personas
- from easy questions to hard
We release 96 "quirky" language models that are LoRA finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. This repository contains the code to train and use these models to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading.
We also release (various subsets of) the quirky datasets.
elk_generalization/datasets/create_datasets.py
generates the 12 quirky datasets (with source data dependencies noted in the code)elk_generalization/training/sft.py
can be used to finetune quirky modelselk_generalization/elk/run_transfers.py
can be used to probe models and get output (extract_hiddens.py
gets hidden states and LM outputs, whiletransfer
trains and tests probes)elk_generalization/anomaly/run_anomaly.py
reads probe outputs from above and classifies anomalies using mechanistic anomaly detectionelk_generalization/results/figures.ipynb
can be used to reproduce our figures
ArXiv: https://arxiv.org/abs/2312.01037
Cite:
@misc{mallen2023eliciting,
title={Eliciting Latent Knowledge from Quirky Language Models},
author={Alex Mallen and Nora Belrose},
year={2023},
eprint={2312.01037},
archivePrefix={arXiv},
primaryClass={cs.LG}
}