This repository hosts the code for the paper, accepted to the ICML 2024 Workshop on Foundation Models in the Wild, https://openreview.net/forum?id=HmYJ16ehbX
The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks.
The merged models used in the paper are available in the HuggingFace Hub. They are merges from the corresponding Mistral and Prometheus models.
Model | Size | Merged from |
---|---|---|
Merge-Mistral-Prometheus-7B | 7B | Prometheus-7B-v2, Mistral-7B-Instruct |
Merge-Mixtral-Prometheus-8x7B | 8x7B | Prometheus-8x7B-v2, Mixtral-8x7B-Instruct |
Both models used the linear merge method from mergekit with the following config (change models accordingly):
models:
- model: prometheus-eval/prometheus-8x7b-v2.0
parameters:
weight: 1.0
- model: mistralai/Mixtral-8x7B-Instruct-v0.1
parameters:
weight: 1.0
merge_method: linear
dtype: bfloat16
This can be done following the generate_data.ipynb
notebook.
It relies on ollama for faster inference. Thus, you need to get any of the previous models, and convert them to .gguf format with llama.cpp. After this, you can edit the ollama_templates/
directory, so each template points to your .gguf
files, and then create the corresponding models with ollama create model-name -f path_to_template
. Essentialy, the templates just specify the path to the weights, and the system prompt: You are a helpful yet harmless assistant that avoids generating illegal or harmful content.
This can be done with the evaluate.ipynb
notebook.
You just need to specify the path to the .json file created in the previous step, with the generated responses, and specify a together.ai API key, as it uses the Llama-Guard-2 model served there. After evaluating (around 1 minute for the 52 test prompts), the json file is modified with two new keys, with the safety scores for the original and revised responses.
Since Llama-Guard-2 outputs safe
or unsafe
when evaluating, we can compute the attack success rate (ASR) over the scores
If you find the research or code useful for you, please consider citing
@inproceedings{
gallego2024merging,
title={Merging Improves Self-Critique Against Jailbreak Attacks},
author={Victor Gallego},
booktitle={ICML 2024 Workshop on Foundation Models in the Wild},
year={2024},
url={https://openreview.net/forum?id=HmYJ16ehbX}
}