This repository contains the code for our paper, "What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations."
Caveat: The manuscript is unpublished and subject to change. Our submission will likely replace datasets with ones that are more grounded in the literature.
- Install the PyPI package:
pip install biasprobe
- Extract some embeddings. If you don't have a GPU with at least 24GB of VRAM, change the device mapping to the CPU:
from biasprobe import SimplePairPromptBuilder
import torch
# Load the LLM and extractors. `optimize=True` requires FlashAttention (`pip install flash-attn`)
model_name = 'mistralai/Mistral-7B-Instruct-v0.1'
runner = PairwiseExtractionRunner.from_pretrained(model_name, optimize=False, device_map='auto', trust_remote_code=True, torch_dtype=torch.float16)
builder = SimplePairPromptBuilder(criterion='more positive')
# Define the training set attribute words
bad_words = ['sad', 'upset', 'panic', 'anxiety', 'fear']
good_words = ['happy', 'joy', 'grateful', 'satisfaction', 'love']
# Define the test set words
test_words = ['libertarian', 'authoritarian', 'democrat', 'republican']
# Run the extraction
train_exp = runner.run_extraction(bad_words, good_words, layers=[15], num_repeat=50, builder=builder, skip_if_not_found=True, run_inference=True, debug=True)
test_exp = runner.run_extraction(test_words, test_words, layers=[15], num_repeat=50, builder=builder, skip_if_found=True, run_inference=True, debug=True)
- Train our probe:
from biasprobe import ProbeConfig, BinaryProbe, ProbeTrainer
train_ds = train_exp.make_dataset(15, label_type='predicted')
test_ds = test_exp.make_dataset(15)
config = ProbeConfig.create_for_model('mistralai/Mistral-7B-Instruct-v0.1')
probe = BinaryProbe(config)
trainer = ProbeTrainer(probe.cuda())
trainer.fit(train_ds)
_, preferred_pairs = trainer.predict(test_ds)
preferred_pairs
contains a list of tuples, where the first item is preferred over the second. Let's look at the results:
>>> preferred_pairs
[['democrat', 'republican'],
['democrat', 'libertarian'],
['libertarian', 'authoritarian'],
['libertarian', 'democrat'],
['democrat', 'republican'],
...
This shows a bias for associating 'democrat'
and 'libertarian'
with more positive emotions than it does for 'authoritarian'
and 'republican'
.
@article{tang2023found,
title={What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations},
author={Tang, Raphael and Zhang, Xinyu and Lin, Jimmy and Ture, Ferhan},
journal={arXiv:2311.18812},
year={2023}
}