Stephen Casper scasper@mit.edu
Jason Lin
Joe Kwon
Gatlen Culp
Dylan Hadfield-Menell
Read the paper on arXiv: Explore, Establish, Exploit: Red Teaming Language Models from Scratch.
Check out the CommonClaim dataset.
@misc{casper2023explore,
title={Explore, Establish, Exploit: Red Teaming Language Models from Scratch},
author={Stephen Casper and Jason Lin and Joe Kwon and Gatlen Culp and Dylan Hadfield-Menell},
year={2023},
eprint={2306.09442},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This repository contains a modified version of the trlx library, commmit 18ffb1ae09 from January 3, 2023.
All code has been tested with python 3.10.
pip install -r requirements.txt
git clone https://github.com/thestephencasper/explore_establish_exploit_llms.git
cd trlx
pip install -e .
cd ..
mkdir models
mkdir data
This repository contains all resources needed to red team the open-source GPT-2-xl in order to elicit toxic outputs.
The 4 e's:
python explore.py
python establish.py
python exploit.py
python evaluate.py
Be warned that the final results will be offensive in nature.
Then check the results which will be summarized in /data/results.txt