We study the controllable jailbreaks on large language models (LLMs). Specifically, we focus on how to enforce control on LLM attacks. In this work, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios including:
- Fluent suffix attacks (standard attack setting which append the adversarial prompt to the original malicious user query).
- Paraphrase attack with and without sentiment steering (revising a user query adversarially with minimal paraphrasing).
- Attack with left-right-coherence (inserting stealthy attacks in context with left-right-coherence).
More details can be found in our paper: Xingang Guo*, Fangxu Yu*, Huan Zhang, Lianhui Qin, Bin Hu, "COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability" (* Equal contribution)
As illustrated in the above diagram, our COLD-Attack framework includes three main steps:
- Energy function formulation: specify energy functions properly to capture the attack constraints such as fluency, stealthiness, sentiment, and left-right-coherence.
-
Langevin dynamics sampling: run Langevin dynamics recursively for
$N$ steps to obtain a good energy-based model governing the adversarial attack logits$\tilde{\mathbf{y}}^N$ . -
Decoding process: leverage an LLM-guided decoding process to covert the continuous logit
$\tilde{\mathbf{y}}^N$ into discrete text attacks$\mathbf{y}$ .
Here are some examples that generated by COLD-Attack:
We evaluate the performance of COLD-Attack on four popular white-box LLMs: Vicuna-7b-v1.5 (Vicuna), Llama-2-7b-Chat-hf (Llama2), Guanaco-7b (Guanaco), and Mistral-7b-Instruct-v0.2 (Mistral). In addition, we use the following three main evaluation metrics:
- Attack Successful Rate (ASR): the percentage of instructions that elicit corresponding harmful outputs using sub-string matching method.
- GPT-4 based ASR (ASR-G): We develop a prompt template and utilize GPT-4 to assess whether a response accurately fulfills the malicious instruction. Based on our observations, ASR-G has shown higher correlation with human annotations, providing a more reliable measure of attack effectiveness.
- Perplexity (PPL): We use PPL to evaluate the fluency of the generated prompts and use Vicuna-7b to do the PPL calculation.
To ensure the generated adversarial prompts meet specific criteria, we apply controls over various features, including sentiment and vocabulary. We evaluate how well these controls work using a metric called Succ, which represents the percentage of samples that successfully adhere to our set requirements. Additionally, a range of NLP-related evaluation metrics including BERTScore, BLEU, and ROUGE are applied to evaluate the quality of generated controllable attacks.
Models | ASR ↑ | ASR-G ↑ | PPL ↓ |
---|---|---|---|
Vicuna | 100.0 | 86.00 | 32.96 |
Guanaco | 96.00 | 84.00 | 30.55 |
Mistral | 92.00 | 90.00 | 26.24 |
Llama2 | 92.00 | 66.00 | 24.83 |
Models | ASR ↑ | ASR-G ↑ | PPL ↓ | BLEU ↑ | ROUGE ↑ | BERTScore ↑ |
---|---|---|---|---|---|---|
Vicuna | 96.00 | 80.00 | 31.11 | 0.52 | 0.57 | 0.72 |
Guanaco | 98.00 | 78.00 | 29.23 | 0.47 | 0.55 | 0.74 |
Mistral | 98.00 | 90.00 | 37.21 | 0.41 | 0.55 | 0.72 |
Llama2 | 86.00 | 74.00 | 39.26 | 0.60 | 0.54 | 0.71 |
Models | ASR ↑ | ASR-G ↑ | Succ ↑ | PPL ↓ |
---|---|---|---|---|
Sentiment Constraint | ||||
Vicuna | 90.00 | 96.00 | 84.00 | 66.48 |
Guanaco | 96.00 | 94.00 | 82.00 | 74.05 |
Mistral | 92.00 | 96.00 | 92.00 | 67.61 |
Llama2 | 80.00 | 88.00 | 64.00 | 59.53 |
Lexical Constraint | ||||
Vicuna | 92.00 | 100.00 | 82.00 | 76.69 |
Guanaco | 92.00 | 96.00 | 82.00 | 99.03 |
Mistral | 94.00 | 84.00 | 92.00 | 96.06 |
Llama2 | 88.00 | 86.00 | 68.00 | 68.23 |
Format Constraint | ||||
Vicuna | 92.00 | 94.00 | 88.00 | 67.63 |
Guanaco | 92.00 | 94.00 | 72.00 | 72.97 |
Mistral | 94.00 | 86.00 | 84.00 | 44.56 |
Llama2 | 80.00 | 86.00 | 72.00 | 57.70 |
Style Constraint | ||||
Vicuna | 94.00 | 96.00 | 80.00 | 81.54 |
Guanaco | 94.00 | 92.00 | 70.00 | 75.25 |
Mistral | 92.00 | 90.00 | 86.00 | 54.50 |
Llama2 | 80.00 | 80.00 | 68.00 | 58.93 |
Please see more detaieled evaluation results and discussions in our paper.
1) Download this GitHub
git clone https://github.com/Yu-Fangxu/COLD-Attack.git
2) Setup Environment
We recommend conda for setting up a reproducible experiment environment.
We include environment.yaml
for creating a working environment:
conda env create -f environment.yaml -n cold-attack
You will then need to setup NLTK and hugging face:
conda activate cold-attack
python3 -c "import nltk; nltk.download('stopwords', 'averaged_perceptron_tagger', 'punkt'); "
To run the Llama-2 model, you will need to request access at Hugging face and setup account login:
huggingface-cli login --token [Your Hugging face token]
3) Run Command for COLD-Attack
- Fluent suffix attack
bash attack.sh "suffix"
- Paraphrase attack
bash attack.sh "paraphrase"
- Left-right-coherence control
bash attack.sh "control"
If you find our repository helpful to your research, please consider citing:
@article{guo2024cold,
title={Cold-attack: Jailbreaking llms with stealthiness and controllability},
author={Guo, Xingang and Yu, Fangxu and Zhang, Huan and Qin, Lianhui and Hu, Bin},
journal={arXiv preprint arXiv:2402.08679},
year={2024}
}
@article{qin2022cold,
title={Cold decoding: Energy-based constrained text generation with langevin dynamics},
author={Qin, Lianhui and Welleck, Sean and Khashabi, Daniel and Choi, Yejin},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={9538--9551},
year={2022}
}