🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.
But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.
This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.
Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.
🧑💻 Four Levels of Multimodal Jailbreak lifecycle
- Input Level: Attackers and defenders operate solely on the input data. Attackers modify inputs to execute attacks, while defenders incorporate protective cues to enhance detection.
- Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space.
- Generator Level: : With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, while defenders use these techniques to strengthen model robustness.
- Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs, while defenders can apply post-processing techniques to enhance detection.
Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.
- 😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models🛡️
Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.
Short Name | Modality | Representative Model |
---|---|---|
IT2T | I + T → T | LLaVA, MiniGPT4, InstructBLIP |
VT2T | V + T → T | Video-LLaVA, Video-LLaMA |
AT2T | A + T → T | Audio Flamingo, Audiopalm |
Short Name | Modality | Representative Model |
---|---|---|
T2I | T → I | Stable Diffusion, Midjourney, DALLE |
IT2I | I + T → I | DreamBooth, InstructP2P |
T2V | T → V | Open-Sora, Stable Video Diffusion |
IT2V | I + T → V | VideoPoet, CogVideoX |
Short Name | Modality | Representative Model |
---|---|---|
IT2IT | I + T → I + T | Next-GPT, Chameleon |
TIV2TIV | T + I + V → T + I + V | EMU3 |
Any2Any | Any → Any | GPT-4o, Gemini Ultra |
We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.
- Input-level attack: attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques.
- Output-level attack: Attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.
- Encoder-level attack: Attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe.
- Generator-level attack: Attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.
Below are the papers related to jailbreak attacks.
Title | Venue | Date | Code | Taxonomy | Multimodal Model |
---|---|---|---|---|---|
Gradient-based Jailbreak Images for Multimodal Fusion Models | Arxiv 2024 | 2024/10/4 | Github | --- | IT2IT |
Voice jailbreak attacks against gpt-4o | Arxiv 2024 | 2024/05/29 | Github | Input Level | Any2Any |
Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.
- Discriminative defenses: is constrained to classification tasks for assigning binary labels.
- Transformative Defense: aims to produce appropriate and safe responses in the presence of malicious or adversarial inputs.
Below are the papers related to jailbreak defense.
Title | Venue | Date | Code | Taxonomy | Multimodal Model |
---|
Dataset | Task | Text Source | Image Source | Volume | Access |
---|---|---|---|---|---|
SafeBench | Attack | GPT generation | Typography | 500 | Github |
AdvBench | Attack | LLM generation | N/A | 500 | Github |
ReadTeam-2K | Attack | Exist. & GPT Generation | N/A | 2000 | Huggingface |
HarmBench | Attack & Defense | Unpublished | N/A | 320 | Github |
HADES | Defense | GPT generation | Typography & Diffusion Generation | 750 | Github |
MM-SafetyBench | Defense | GPT generation | Typography & Diffusion Generation | 5040 | Github |
JailBreakV-28K | Defense | Adv. Prompt on ReadTeam-2K | Blank & Noise & Natural & Synthesize | 28000 | Huggingface |
VLGuard | Defense | GPT generation | Exist. | 3000 | Huggingface |
Dataset | Task | Text Source | Image Source | Volume | Access |
---|---|---|---|---|---|
NSFW-200 | Attack | Human curation | N/A | 200 | Github |
MMA | Attack | Exist.& Adv. Prompt | N/A | 1000 | Huggingface |
VBCDE-100 | Attack | Human curation | N/A | 100 | Github |
I2P | Attack & Defense | Real-world Website | Real-world Website | 4703 | Huggingface |
Unsafe Diffusion | Defense | Human curation& Website&Exist. | N/A | 1434 | Github |
MACE | Defense | Human curation | Diffusion Generation | 200 | Github |
Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.
- Manual evaluation involves human assessment to determine if the content is toxic, offering a direct and interpretable method of evaluation.
- Automated approaches assess the safety of multimodal generative models by employing a range of techniques, including detector-based, GPT-based, and rule-based methods.
Toxicity detector | Access |
---|---|
LLama-Guard | Huggingface |
LLama-Guard2 | Huggingface |
Detoxify | Github |
GPTFUZZER | Huggingface |
Perspective API | Website |
Toxicity detector | Access |
---|---|
NudeNet | Github |
Q16 | Github |
Safety Checker | Huggingface |
Imgcensor | Github |
Multi-headed Safety Classifier | Github |
If you find this work useful in your research, Please kindly cite using the following BibTex:
@article{liu2024jailbreak,
title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
journal={arXiv preprint arXiv:2411.09259},
year={2024},
}