😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.
But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.

🤗Introduction

This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.
Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.

🧑‍💻 Four Levels of Multimodal Jailbreak lifecycle

Input Level: Attackers and defenders operate solely on the input data. Attackers modify inputs to execute attacks, while defenders incorporate protective cues to enhance detection.
Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space.
Generator Level: : With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, while defenders use these techniques to strengthen model robustness.
Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs, while defenders can apply post-processing techniques to enhance detection.

Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.

🚀Table of Contents

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models🛡️

🔥Multimodal Generative Models

Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.

📑Any-to-Text Models (LLM Backbone)

Short Name	Modality	Representative Model
IT2T	I + T → T	LLaVA, MiniGPT4, InstructBLIP
VT2T	V + T → T	Video-LLaVA, Video-LLaMA
AT2T	A + T → T	Audio Flamingo, Audiopalm

📖Any-to-Vision (Diffusion Backbone)

Short Name	Modality	Representative Model
T2I	T → I	Stable Diffusion, Midjourney, DALLE
IT2I	I + T → I	DreamBooth, InstructP2P
T2V	T → V	Open-Sora, Stable Video Diffusion
IT2V	I + T → V	VideoPoet, CogVideoX

📰Any-to-Any (Unified Backbone)

Short Name	Modality	Representative Model
IT2IT	I + T → I + T	Next-GPT, Chameleon
TIV2TIV	T + I + V → T + I + V	EMU3
Any2Any	Any → Any	GPT-4o, Gemini Ultra

😈JailBreak Attack

📖Attack-Intro

We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.

Input-level attack: attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques.
Output-level attack: Attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.

Encoder-level attack: Attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe.
Generator-level attack: Attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.

📑Papers

Below are the papers related to jailbreak attacks.

Jailbreak Attack of Any-to-Text Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models	Arxiv 2024	2024/11/18	None	---	IT2T
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves	Arxiv 2024	2024/11/15	None	---	IT2T
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models	Arxiv 2024	2024/11/12	None	---	IT2T
Audio is the achilles’heel: Red teaming audio large multimodal models	Arxiv 2024	2024/10/31	None	Input Level	AT2T
Advweb: Controllable black-box attacks on vlm-powered web agents	Arxiv 2024	2024/10/22	None	Input Level	IT2T
Image Hijacks: Adversarial Images can Control Generative Models at Runtime	Arxiv 2024	2024/09/01	Github	Generator Level	IT2T
Can Large Language Models Automatically Jailbreak GPT-4V?	CCS 2024	2024/07/23	None	Input Level	IT2T
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts	ACM MM 2024	2024/07/21	None	Input Level	IT2T
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything	Arxiv 2024	2024/07/01	None	Input Level	IT2T
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking	Arxiv 2024	2024/06/21	None	Encoder Level	IT2T
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt	Arxiv 2024	2024/06/06	Github	Generator Level	IT2T
Efficient LLM-Jailbreaking by Introducing Visual Modality	Arxiv 2024	2024/05/30	None	Generator Level	IT2T
White-box Multimodal Jailbreaks Against Large Vision-Language Models	Arxiv 2024	2024/05/28	None	Generator Level	IT2T
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character	Arxiv 2024	2024/05/25	None	Input Level	IT2T
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models	ECCV 2024	2024/05/14	Github	Generator Level	IT2T
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast	ICML 2024	2024/02/13	Github	Decoder Level	IT2T
Jailbreaking Attack against Multimodal Large Language Model	Arxiv 2024	2024/02/04	None	Generator Level	IT2T
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models	ICLR 2024 Spotlight	2024/01/16	Github	Encoder Level	IT2T
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	ECCV 2024	2023/11/27	Github	Encoder Level	IT2T
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts	Arxiv 2023	2023/11/15	None	Input Level	IT2T
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts	Arxiv 2023	2023/11/09	Github	Input Level	IT2T
Are aligned neural networks adversarially aligned?	Arxiv 2023	2023/06/26	None	Generator Level	IT2T
Visual Adversarial Examples Jailbreak Aligned Large Language Models	AAAI 2024	2023/06/22	Github	Generator Level	IT2T
On Evaluating Adversarial Robustness of Large Vision-Language Models	NeurIPS 2023	2023/05/26	Homepage	---	IT2T

Jailbreak Attack of Any-to-Vision Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Unfiltered and Unseen: Universal Multimodal Jailbreak Attacks on Text-to-Image Model Defenses	Openreview	2024/11/13	None	---	T2I
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step	Arxiv 2024	2024/10/4	None	---	T2I
RT-Attack: Jailbreaking Text-to-Image Models via Random Token	Arxiv 2024	2024/08/25	None	Encoder Level	T2I
Perception-guided Jailbreak against Text-to-Image Models	Arxiv 2024	2024/08/20	None	Input Level	T2I
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization	Arxiv 2024	2024/08/18	None	Output Level	T2I
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models	Arxiv 2024	2024/08/02	None	Encoder Level	T2I
Jailbreaking Text-to-Image Models with LLM-Based Agents	Arxiv 2024	2024/08/01	None	Input Level	T2I
Automatic Jailbreaking of the Text-to-Image Generative AI Systems	Arxiv 2024	2024/05/26	None	Input Level	T2I
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers	ICML 2024	2024/05/18	None	Input Level	T2I
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators	Arxiv 2024	2024/02/23	None	Encoder Level	T2I
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models	Arxiv 2023	2023/12/12	Github	Input Level	T2I
MMA-Diffusion: MultiModal Attack on Diffusion Models	CVPR 2024	2023/11/29	Github	Encoder Level	T2I
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models	CVPR 2024	2023/11/29	Github	Encoder Level	T2I
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now	ECCV 2024	2023/10/18	Github	Generator Level	T2I
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?	ICLR 2024	2023/10/16	Github	Encoder Level	T2I
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution	CCS 2024	2023/09/25	Github	Input Level	T2I
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts	ICML 2024	2023/09/12	Github	Generator Level	T2I
SneakyPrompt: Jailbreaking Text-to-image Generative Models	Symposium on Security and Privacy 2024	2023/05/20	Github	Output Level	T2I
Red-Teaming the Stable Diffusion Safety Filter	NeurIPSW 2022	2022/10/03	None	Input Level	T2I

Jailbreak Attack of Any-to-Any Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Gradient-based Jailbreak Images for Multimodal Fusion Models	Arxiv 2024	2024/10/4	Github	---	IT2IT
Voice jailbreak attacks against gpt-4o	Arxiv 2024	2024/05/29	Github	Input Level	Any2Any

🛡️Jailbreak Defense

📖Defense-Intro

Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.

Discriminative defenses: is constrained to classification tasks for assigning binary labels.

Transformative Defense: aims to produce appropriate and safe responses in the presence of malicious or adversarial inputs.

📑Papers

Below are the papers related to jailbreak defense.

Jailbreak Defense of Any-to-Text Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks	Arxiv 2024	2024/11/23	None	---	IT2T
Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language models	Arxiv 2024	2024/11/03	None	Input Level	IT2T
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector	Arxiv 2024	2024/10/30	None	Generator Level	IT2T
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks	Arxiv 2024	2024/10/28	None	Input Level	IT2T
Information-theoretical principled trade-off between jailbreakability and stealthiness on vision language models	Arxiv 2024	2024/10/02	None	Input Level	IT2T
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks	Arxiv 2024	2024/09/11	None	Encoder Level	IT2T
Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor trigger	Arxiv 2024	2024/08/17	None	Generator Level	IT2T
Defending jailbreak attack in vlms via cross-modality information detector	Arxiv 2024	2024/07/31	Github	Encoder Level	IT2T
Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models	Arxiv 2024	2024/07/20	None	Encoder Level	IT2T
Cross-modal safety alignment: Is textual unlearning all you need?	Arxiv 2024	2024/05/27	None	Generator Level	IT2T
Safety alignment for vision language models	Arxiv 2024	2024/05/22	None	Generator Level	IT2T
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting	ECCV 2024	2024/05/14	Github	Input Level	IT2T
Safety fine-tuning at (almost) no cost: A baseline for vision large language models	ICML 2024	2024/02/03	Github	Generator Level	IT2T
Inferaligner: Inference-time alignment for harmlessness through cross-model guidance	Arxiv 2024	2024/01/20	Github	Encoder Level	IT2T
Mllm-protector: Ensuring mllm’s safety without hurting performance	Arxiv 2024	2024/01/05	Github	Output Level	IT2T
Jailguard: A universal detection framework for llm prompt-based attacks	Arxiv 2023	2023/12/17	None	Encoder Level	IT2T
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions	ICLR 2024	2023/09/14	Github	Generator Level	IT2T

Jailbreak Defense of Any-to-Vision Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model
Safree: Training-free and adaptive guard for safe text-to-image and video generation	Arxiv 2024	2024/10/16	None	Output Level	T2I/T2V
Safree: Training-free and adaptive guard for safe text-to-image and video generation	Arxiv 2024	2024/10/16	None	Output Level	T2I/T2V
Dark miner: Defend against unsafe generation for text-to-image diffusion models	Arxiv 2024	2024/09/26	None	Generator Level	T2I
Score forgetting distillation: A swift, data-free method for machine unlearning in diffusion models	Arxiv 2024	2024/09/17	None	Generator Level	T2I
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts	Arxiv 2024	2024/08/02	None	Generator Level	T2I
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models	ICML GenLaw workshop 2024	2024/07/17	None	Encoder Level	T2I
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models	ECCV 2024	2024/07/17	Github	Generator Level	T2I
Conceptprune: Concept editing in diffusion models via skilled neuron pruning	Arxiv 2024	2024/05/29	Github	Generator Level	T2I
Pruning for Robust Concept Erasing in Diffusion Models	Arxiv 2024	2024/05/26	None	Generator Level	T2I
Defensive unlearning with adversarial training for robust concept erasure in diffusion models	Nips 2024	2024/05/24	Github	Encoder Level	T2I
Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient	Arxiv 2024	2024/05/24	None	Encoder Level	T2I
Espresso: Robust Concept Filtering in Text-to-Image Models	Arxiv 2024	2024/04/30	None	Output Level	T2I
Latent Guard: a Safety Framework for Text-to-image Generation	ECCV 2024	2024/04/11	Github	Encoder Level	T2I
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models	ACM CCS 2024	2024/04/10	Github	Generator Level	T2I
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation	ICLR 2024	2024/04/04	Github	Generator Level	T2I
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts	NIPS 2024	2024/03/03	None	Input Level	T2I
Universal prompt optimizer for safe text-to-image generation	Arxiv 2024	2024/02/16	None	Input Level	T2I
Erasediff: Erasing data influence in diffusion models	Arxiv 2024	2024/01/11	None	Generator Level	T2I
Localization and manipulation of immoral visual cues for safe text-to-image generation	WACV 2024	2024/01/01	None	Output Level	T2I
Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers	ECCV 2024	2023/11/29	None	Generator Level	T2I
Self-discovering interpretable diffusion latent directions for responsible text-to-image generation	CVPR 2024	2023/11/28	Github	Input Level	T2I
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models	ECCV 2024	2023/11/27	Github	Encoder Level	T2I
Mace: Mass concept erasure in diffusion models	CVPR 2024	2023/10/19	Github	Generator Level	T2I
Implicit concept removal of diffusion models	ECCV 2024	2023/10/09	None	Input Level	T2I
Unified concept editing in diffusion models	WACV 2024	2023/08/25	Github	Generator Level	T2I
Towards safe self-distillation of internet-scale text-to-image diffusion models	ICML 2023 Workshop on Challenges in Deployable Generative AI	2023/07/12	Github	Generator Level	T2I
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models	CVPR 2024	2023/05/30	Github	Generator Level	T2I
Erasing concepts from diffusion models	ICCV 2023	2023/05/13	Github	Generator Level	T2I
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models	CVPR 2023	2022/11/09	Github	Generator Level	T2I

Jailbreak Defense of Any-to-Any Models

Title	Venue	Date	Code	Taxonomy	Multimodal Model

💯Evaluation

⭐️Evaluation Datasets

Used to Any-to-Text Models

Dataset	Task	Text Source	Image Source	Volume	Access
SafeBench	Attack	GPT generation	Typography	500	Github
AdvBench	Attack	LLM generation	N/A	500	Github
ReadTeam-2K	Attack	Exist. & GPT Generation	N/A	2000	Huggingface
HarmBench	Attack & Defense	Unpublished	N/A	320	Github
HADES	Defense	GPT generation	Typography & Diffusion Generation	750	Github
MM-SafetyBench	Defense	GPT generation	Typography & Diffusion Generation	5040	Github
JailBreakV-28K	Defense	Adv. Prompt on ReadTeam-2K	Blank & Noise & Natural & Synthesize	28000	Huggingface
VLGuard	Defense	GPT generation	Exist.	3000	Huggingface

Used to Any-to-Vision Models

Dataset	Task	Text Source	Image Source	Volume	Access
NSFW-200	Attack	Human curation	N/A	200	Github
MMA	Attack	Exist.& Adv. Prompt	N/A	1000	Huggingface
VBCDE-100	Attack	Human curation	N/A	100	Github
I2P	Attack & Defense	Real-world Website	Real-world Website	4703	Huggingface
Unsafe Diffusion	Defense	Human curation& Website&Exist.	N/A	1434	Github
MACE	Defense	Human curation	Diffusion Generation	200	Github

📚Evaluation Methods

Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.

Manual evaluation involves human assessment to determine if the content is toxic, offering a direct and interpretable method of evaluation.
Automated approaches assess the safety of multimodal generative models by employing a range of techniques, including detector-based, GPT-based, and rule-based methods.

Text Detector

Toxicity detector	Access
LLama-Guard	Huggingface
LLama-Guard2	Huggingface
Detoxify	Github
GPTFUZZER	Huggingface
Perspective API	Website

Image Detector

Toxicity detector	Access
NudeNet	Github
Q16	Github
Safety Checker	Huggingface
Imgcensor	Github
Multi-headed Safety Classifier	Github

😉Citation

If you find this work useful in your research, Please kindly cite using the following BibTex:

@article{liu2024jailbreak,
    title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
    author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
    journal={arXiv preprint arXiv:2411.09259},
    year={2024},
}

liuxuannan/Awesome-Multimodal-Jailbreak

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🤗Introduction

🚀Table of Contents

🔥Multimodal Generative Models

📑Any-to-Text Models (LLM Backbone)

📖Any-to-Vision (Diffusion Backbone)

📰Any-to-Any (Unified Backbone)

😈JailBreak Attack

📖Attack-Intro

📑Papers

Jailbreak Attack of Any-to-Text Models

Jailbreak Attack of Any-to-Vision Models

Jailbreak Attack of Any-to-Any Models

🛡️Jailbreak Defense

📖Defense-Intro

📑Papers

Jailbreak Defense of Any-to-Text Models

Jailbreak Defense of Any-to-Vision Models

Jailbreak Defense of Any-to-Any Models

💯Evaluation

⭐️Evaluation Datasets

Used to Any-to-Text Models

Used to Any-to-Vision Models

📚Evaluation Methods

Text Detector

Image Detector

😉Citation