COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

We study the controllable jailbreaks on large language models (LLMs). Specifically, we focus on how to enforce control on LLM attacks. In this work, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios including:

Fluent suffix attacks (standard attack setting which append the adversarial prompt to the original malicious user query).
Paraphrase attack with and without sentiment steering (revising a user query adversarially with minimal paraphrasing).
Attack with left-right-coherence (inserting stealthy attacks in context with left-right-coherence).

More details can be found in our paper: Xingang Guo*, Fangxu Yu*, Huan Zhang, Lianhui Qin, Bin Hu, "COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability" (* Equal contribution)

COLD-Attack

As illustrated in the above diagram, our COLD-Attack framework includes three main steps:

Energy function formulation: specify energy functions properly to capture the attack constraints such as fluency, stealthiness, sentiment, and left-right-coherence.
Langevin dynamics sampling: run Langevin dynamics recursively for $N$ steps to obtain a good energy-based model governing the adversarial attack logits $\tilde{\mathbf{y}}^N$.
Decoding process: leverage an LLM-guided decoding process to covert the continuous logit $\tilde{\mathbf{y}}^N$ into discrete text attacks $\mathbf{y}$.

Selected Examples

Here are some examples that generated by COLD-Attack:

Jailbreaking Performance

We evaluate the performance of COLD-Attack on four popular white-box LLMs: Vicuna-7b-v1.5 (Vicuna), Llama-2-7b-Chat-hf (Llama2), Guanaco-7b (Guanaco), and Mistral-7b-Instruct-v0.2 (Mistral). In addition, we use the following three main evaluation metrics:

Attack Successful Rate (ASR): the percentage of instructions that elicit corresponding harmful outputs using sub-string matching method.
GPT-4 based ASR (ASR-G): We develop a prompt template and utilize GPT-4 to assess whether a response accurately fulfills the malicious instruction. Based on our observations, ASR-G has shown higher correlation with human annotations, providing a more reliable measure of attack effectiveness.
Perplexity (PPL): We use PPL to evaluate the fluency of the generated prompts and use Vicuna-7b to do the PPL calculation.

To ensure the generated adversarial prompts meet specific criteria, we apply controls over various features, including sentiment and vocabulary. We evaluate how well these controls work using a metric called Succ, which represents the percentage of samples that successfully adhere to our set requirements. Additionally, a range of NLP-related evaluation metrics including BERTScore, BLEU, and ROUGE are applied to evaluate the quality of generated controllable attacks.

Fluent suffix attack

Models	ASR ↑	ASR-G ↑	PPL ↓
Vicuna	100.0	86.00	32.96
Guanaco	96.00	84.00	30.55
Mistral	92.00	90.00	26.24
Llama2	92.00	66.00	24.83

Paraphrase attack

Models	ASR ↑	ASR-G ↑	PPL ↓	BLEU ↑	ROUGE ↑	BERTScore ↑
Vicuna	96.00	80.00	31.11	0.52	0.57	0.72
Guanaco	98.00	78.00	29.23	0.47	0.55	0.74
Mistral	98.00	90.00	37.21	0.41	0.55	0.72
Llama2	86.00	74.00	39.26	0.60	0.54	0.71

Left-right-coherence control

Models	ASR ↑	ASR-G ↑	Succ ↑	PPL ↓
Sentiment Constraint
Vicuna	90.00	96.00	84.00	66.48
Guanaco	96.00	94.00	82.00	74.05
Mistral	92.00	96.00	92.00	67.61
Llama2	80.00	88.00	64.00	59.53
Lexical Constraint
Vicuna	92.00	100.00	82.00	76.69
Guanaco	92.00	96.00	82.00	99.03
Mistral	94.00	84.00	92.00	96.06
Llama2	88.00	86.00	68.00	68.23
Format Constraint
Vicuna	92.00	94.00	88.00	67.63
Guanaco	92.00	94.00	72.00	72.97
Mistral	94.00	86.00	84.00	44.56
Llama2	80.00	86.00	72.00	57.70
Style Constraint
Vicuna	94.00	96.00	80.00	81.54
Guanaco	94.00	92.00	70.00	75.25
Mistral	92.00	90.00	86.00	54.50
Llama2	80.00	80.00	68.00	58.93

Please see more detaieled evaluation results and discussions in our paper.

Code

1) Download this GitHub

git clone https://github.com/Yu-Fangxu/COLD-Attack.git

2) Setup Environment

We recommend conda for setting up a reproducible experiment environment. We include environment.yaml for creating a working environment:

conda env create -f environment.yaml -n cold-attack

You will then need to setup NLTK and hugging face:

conda activate cold-attack
python3 -c "import nltk; nltk.download('stopwords', 'averaged_perceptron_tagger', 'punkt'); "

To run the Llama-2 model, you will need to request access at Hugging face and setup account login:

huggingface-cli login --token [Your Hugging face token]

3) Run Command for COLD-Attack

Fluent suffix attack

bash attack.sh "suffix"

Paraphrase attack

bash attack.sh "paraphrase"

Left-right-coherence control

bash attack.sh "control"

If you find our repository helpful to your research, please consider citing:

@article{guo2024cold,
  title={Cold-attack: Jailbreaking llms with stealthiness and controllability},
  author={Guo, Xingang and Yu, Fangxu and Zhang, Huan and Qin, Lianhui and Hu, Bin},
  journal={arXiv preprint arXiv:2402.08679},
  year={2024}
}

@article{qin2022cold,
  title={Cold decoding: Energy-based constrained text generation with langevin dynamics},
  author={Qin, Lianhui and Welleck, Sean and Khashabi, Daniel and Choi, Yejin},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={9538--9551},
  year={2022}
}