This is the official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks" by Andy Zhou, Bo Li, and Haohan Wang. Website with more examples can be found here.
We include a notebook demo.ipynb
containing the minimal implementation of RPO, for defending Llama-2 against the popular AIM attack.
This can also be found on Colab
This repository is based on the GCG repository (https://github.com/llm-attacks/llm-attacks)
The rpo
package can be installed by running the following command at the root of this repository:
pip install livelossplot
pip install -e .
Please follow the instructions to download LLaMA-2-7B-Chat first (we use the weights converted by HuggingFace here). To modify the paths to your models and tokenizers, please add the following lines in experiments/configs/transfer_xxx.py
.
config.model_paths = [
"/DIR/vicuna/vicuna-7b-v1.3",
... # more models
]
config.tokenizer_paths = [
"/DIR/vicuna/vicuna-7b-v1.3",
... # more tokenizers
]
Jailbreaks can be added as text files to the experiments/jailbreaks
folder. The experiments
folder contains code to reproduce RPO experiments on AdvBench. Modify the script with the correct path and run the following code inside experiments
:
cd launch_scripts
bash run_gcg_multiple.sh vicuna # or llama2
Here are some example strings we optimized. These are effective across language models and jailbreaks. It is fair to directly use these strings for baseline comparisons as they are expected to transfer to other models and attacks.
Starling-7B
:}</'])language seems scary-----------}] estimation >>begin &=&pgf degli ({\ $|\textwidthAI NOTE:
Llama2-7B-Chat
was Bel Let]\"> It\'\ si distinguish \({\footnote{[:// Inst bitimidenote]{BWebachiv
We provide a simple script in experiments/test.py
to evaluate the robustness of generated suffixes on jailbreaks. Code will be updated with JailbreakBench evaluation later.
If you found our paper or repo interesting, thanks! Please consider citing the following
@misc{zhou2024robust,
title={Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks},
author={Andy Zhou and Bo Li and Haohan Wang},
year={2024},
eprint={2401.17263},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
rpo
is licensed under the terms of the MIT license. See LICENSE for more details.