This is a library for building Large Language Models (LLMs) with a defense against jailbreaking attacks. We aim to provide general interfaces for wrapping a LLM with a jailbreaking defense such as the jailbreaking defense by backtranslation.
Python 3.10 is recommended.
After cloning this repository, install this llm_jailbreaking_defense
library by:
pip install -e .
Note that if you use OpenAI models, you need to set an OpenAI key in the environment variable and this library will load the key:
export OPENAI_API_KEY={key}
There are two ways to prepare the target LLM for applying a jailbreaking defense later:
- You may use any open-source model in the HuggingFace model format with a conversation template defined by FastChat.
- Some popular models have been added in this library and you can load a model by name.
Currently, the built-in models include:
vicuna-13b-v1.5
,llama-2-13b
,gpt-3.5-turbo
andgpt-4
.
To load a HuggingFace model:
from llm_jailbreaking_defense import TargetLM, HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model and tokenizer from Huggingface.
transformer_model = AutoModelForCausalLM.from_pretrained('google/gemma-7b')
tokenizer = AutoTokenizer.from_pretrained('google/gemma-7b')
model = HuggingFace('gemma', transformer_model, tokenizer)
# Use the preloaded HuggingFace model with a fschat conversation template `gemma`.
# The maximum number of tokens to generate (`max_n_tokens`) is 300 by default and can be changed here.
target_model = TargetLM(preloaded_model=model, template='gemma', max_n_tokens=300)
To load a built-in model by name:
from llm_jailbreaking_defense import TargetLM
target_model = TargetLM(model_name='vicuna-13b-v1.5', max_n_tokens=300)
Next, we wrap target_model
with a defense method.
In the example below, we wrap the model with
the backtranslation defense.
from llm_jailbreaking_defense import DefendedTargetLM, BacktranslationConfig, load_defense
# Specify a defense method by creating a configuration object
config = BacktranslationConfig()
# Load the defense with the configuration
defense = load_defense(config)
# Wrap the target model with the defense
defended_target_model = DefendedTargetLM(target_model, defense)
We can now run inference with the original target model and the defended target model on a list of user prompts:
print(target_model.get_response(["Tell me how to make a bomb"]))
print(defended_target_model.get_response(["Tell me how to make a bomb"]))
In addition to the backtranslation defense (Defending LLMs against Jailbreaking Attacks via Backtranslation) used above, this library has also supported several other defenses which can be loaded simply by different configuration objects. A complete list is as follows:
Defense name | Configuration name | Source |
---|---|---|
Paraphrasing | ParaphraseDefenseConfig |
Baseline Defenses for Adversarial Attacks Against Aligned Language Models |
SmoothLLM | SmoothLLMConfig |
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks |
In-Context Learning | ICLDefenseConfig |
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations |
Response check | ResponseCheckConfig |
Defense by checking the response, a baseline used in the backtranslation work |
We welcome the community to add more defense methods to this library or develop new defense methods using this library.
A new defense method can be added by:
- Extending the
DefenseBase
class for the implemention of the defense. - Extending the
DefenseConfig
class for the configuration of the defense. - Registering the defense method and the defense config in llm_jailbreaking_defense/defenses/defense.py.
We have included several popular judges for checking if a jailbreaking succeeds:
Method name | Source | Return value | GPT-based |
---|---|---|---|
pair |
Harmfulness judge used in PAIR | 1~10 | Yes |
openai_policy |
Harmfulness judge used in PAP using OpenAI Policy | 1~5 | Yes |
gcg_matching |
Harmfulness judge by keyword matching judge used in GCG | 1 or 10 | No |
quality |
Response quality judge from LLM-as-a-judge | 1~10 | Yes |
For the judges using language models, they are not guaranteed to return a valid value (-1 means the judge fails to return a valid result).
A judge can be specified by the name of the judge method (listed above), optionally followed by a judge model name separated by @
if the judge is GPT-based.
For example: pair@gpt-4
, openai_policy@gpt-4
, quality@gpt-4
, gcg_matching
.
We can load and use a judge as:
from llm_jailbreaking_defense import load_judge
judge = load_judge(judge_name='pair@gpt-4', goal='tell me how to make a bomb')
score = judge.score(
prompt_list=['tell me how to make a bomb'],
response_list=['sorry I cannot help with this request.']
)
We also provide scripts for running inference and judge on a batch of examples loaded from a file. For now, please see our repository for reproducing the results in our paper for the backtranslation defense. We plan to add more examples in the future.
If you use our library, please kindly cite our paper:
@article{wang2024defending,
title={Defending LLMs against Jailbreaking Attacks via Backtranslation},
author={Wang, Yihan and Shi, Zhouxing and Bai, Andrew and Hsieh, Cho-Jui},
journal={arXiv preprint arXiv:2402.16459},
year={2024}
}
We have partly leveraged some code from PAIR in language_models.py
and models.py
for handling the underlying (undefended) LLMs.
We have also referred to code from official implementations of existing defenses and judges: