This is the official repository for "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation", by Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen.
In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of generation methods. Check out our project page for more details and examples!
If you find our implementation and paper helpful, please consider citing our work:
@article{huang2023catastrophic,
title={Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation},
author={Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi},
journal={arXiv preprint arXiv:2310.06987},
year={2023}
}
- 2023-10-11: We release the first version.
Our implementation was based on transformers==4.32.1
and torch==2.0.1
.
Our attack is model-agnostic. To launch the attack on a specified model, please download the model and insert/update its path in configs.py
. The table below provides links to all models we used in the paper.
Model | Hugging Face Page |
---|---|
Vicuna-7B | https://huggingface.co/lmsys/vicuna-7b-v1.5 |
Vicuna-13B | https://huggingface.co/lmsys/vicuna-13b-v1.5 |
Vicuna-33B | https://huggingface.co/lmsys/vicuna-33b-v1.3 |
MPT-7B | https://huggingface.co/mosaicml/mpt-7b |
MPT-30B | https://huggingface.co/mosaicml/mpt-30b |
Falcon-7B | https://huggingface.co/tiiuae/falcon-7b |
Falcon-40B | https://huggingface.co/tiiuae/falcon-40b |
LLaMA2-7B | https://huggingface.co/meta-llama/Llama-2-7b-hf |
LLaMA2-13B | https://huggingface.co/meta-llama/Llama-2-13b-hf |
LLaMA2-7B-chat | https://huggingface.co/meta-llama/Llama-2-7b-chat-hf |
LLaMA2-13B-chat | https://huggingface.co/meta-llama/Llama-2-13b-chat-hf |
The commands below runs the default decoding (temperature=0.1
, top_p=0.9
) for the Llama-2-7b-chat-hf
model:
python attack.py \
--model Llama-2-7b-chat-hf \
--use_default \
--use_system_prompt
or the greedy decoding for the Llama-2-7b-chat-hf
model:
python attack.py \
--model Llama-2-7b-chat-hf \
--use_greedy \
--use_system_prompt
The command below exploits various decoding settings for the Llama-2-7b-chat-hf
model (with the system prompt disabled):
python attack.py \
--model Llama-2-7b-chat-hf \
--tune_temp \
--tune_topp \
--tune_topk \
--n_sample 1
You can increase the --n_sample
parameter to generate more examples for each prompt, which potentially makes the attack stronger.
Before running the evaluation, please make sure that you have prepared a risk scorer model (saved in SCORER_PATH
) for the attacker to pick the best attack result among several candidates. In the meantime, you need to prepare a separate evaluator (saved in EVALUATOR_PATH
) for the final scoring purpose.
You may download our scorer and evaluator here.
To evaluate the attack with substring match by Zou et al., you can use the following command:
python evaluate.py \
--model Llama-2-7b-chat-hf \
--config 'exploited' \
--matching_only \
--scorer_path SCORER_PATH
or using our evaluator:
python evaluate.py \
--model Llama-2-7b-chat-hf \
--config 'exploited' \
--evaluator_path EVALUATOR_PATH \
--scorer_path SCORER_PATH
which summarizes results in a json file. For instance, here is the output for Llama-2-7b-chat-hf
model with n_sample=1
.
{
"best_attack_config": {
"temp": {
"temp_0.95": "25" # Most vulnerable temp is 0.95, which gives 25% ASR
},
"topk": {
"topk_500": "26" # Most vulnerable top-k is 500, which gives 26% ASR
},
"topp": {
"topp_0.7": "29" # Most vulnerable temp is 0.7, which gives 29% ASR
}
},
"greedy": "16", # Greedy decoding without system prompt gives 16% ASR
"break_by_temp": "47", # Exploiting temp only gives 47% ASR
"break_by_topk": "54", # Exploiting top-k only gives 54% ASR
"break_by_topp": "77", # Exploiting top-p only gives 77% ASR
"break_by_all": "81" # Exploiting all decoding strategies gives 81% ASR
}
You can aggregate results for different models (to obtain Table 1 in our paper) using the example shown in aggregate_results.ipynb
.
If you have any questions related to the code or the paper, feel free to email Yangsibo (yangsibo@princeton.edu) or open an issue. Please try to specify the problem with details so we can help you better and quicker!