/Awesome-Multimodal-Jailbreak

A Survey on Jailbreak Attacks and Defenses against Multimodal Generative Models

😈🛡️Awesome-Jailbreak-against-Multimodal-Generative-Models

🔥🔥🔥 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Paper

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on Jailbreak Attack and Defense against Multimodel Generative Models.
But we don't stop there; Our repository is constantly updated to ensure you have the most current information at your fingertips.

survey model

🤗Introduction

This survey presents a comprehensive review of existing jailbreak attack and defense against multimodal generative models.
Given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output.

🧑‍💻 Four Levels of Multimodal Jailbreak lifecycle

  • Input Level: Attackers and defenders operate solely on the input data. Attackers modify inputs to execute attacks, while defenders incorporate protective cues to enhance detection.
  • Encoder Level: With access to the encoder, attackers optimize adversarial inputs to inject malicious information into the encoding process, while defenders work to prevent harmful information from being encoded within the latent space.
  • Generator Level: : With full access to the generative models, attackers leverage inference information, such as activations and gradients, and fine-tune models to increase adversarial effectiveness, while defenders use these techniques to strengthen model robustness.
  • Output Level: With the output from the generative model, attackers can iteratively refine adversarial inputs, while defenders can apply post-processing techniques to enhance detection.

Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
We cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems.

survey model

🚀Table of Contents

🔥Multimodal Generative Models

Below are tables of model short name and representative generative models used for jailbreak. For input/output modalities, I: Image, T: Text, V: Video, A: Audio.

📑Any-to-Text Models (LLM Backbone)

Short Name Modality Representative Model
IT2T I + T → T LLaVA, MiniGPT4, InstructBLIP
VT2T V + T → T Video-LLaVA, Video-LLaMA
AT2T A + T → T Audio Flamingo, Audiopalm

📖Any-to-Vision (Diffusion Backbone)

Short Name Modality Representative Model
T2I T → I Stable Diffusion, Midjourney, DALLE
IT2I I + T → I DreamBooth, InstructP2P
T2V T → V Open-Sora, Stable Video Diffusion
IT2V I + T → V VideoPoet, CogVideoX

📰Any-to-Any (Unified Backbone)

Short Name Modality Representative Model
IT2IT I + T → I + T Next-GPT, Chameleon
TIV2TIV T + I + V → T + I + V EMU3
Any2Any Any → Any GPT-4o, Gemini Ultra

😈JailBreak Attack

📖Attack-Intro

We categorize attack methods into black-box, gray-box, and white-box attacks. in a black-box setting where the model is inaccessible to the attacker, the attack is limited to surface-level interactions, focusing solely on the model’s input and/or output. Regarding gray-box and white-box attacks, we consider model-level attacks, including attacks at both the encoder and generator.

  • Input-level attack: attackers are compelled to develop more sophisticated input templates across prompt engineering, image engineering, and role-ploy techniques.
  • Output-level attack: Attackers focus on querying outputs across multiple input variants. Driven by specific adversarial goals, attackers employ estimation-based and search-based attack techniques to iteratively refine these input variants.

jailbreak_attack_black_box

  • Encoder-level attack: Attackers are restricted to accessing only the encoders to provoke harmful responses. In this case, attackers typically seek to maximize cosine similarity within the latent space, ensuring the adversarial input retains similar semantics to the target malicious content while still being classified as safe.
  • Generator-level attack: Attackers have unrestricted access to the generative model’s architecture and checkpoint, enabling attackers to conduct thorough investigations and manipulations, thus enabling sophisticated attacks.

jailbreak_attack_white_and_gray_box

📑Papers

Below are the papers related to jailbreak attacks.

Jailbreak Attack of Any-to-Text Models

Title Venue Date Code Taxonomy Multimodal Model
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models Arxiv 2024 2024/11/18 None --- IT2T
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves Arxiv 2024 2024/11/15 None --- IT2T
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models Arxiv 2024 2024/11/12 None --- IT2T
Audio is the achilles’heel: Red teaming audio large multimodal models Arxiv 2024 2024/10/31 None Input Level AT2T
Advweb: Controllable black-box attacks on vlm-powered web agents Arxiv 2024 2024/10/22 None Input Level IT2T
Image Hijacks: Adversarial Images can Control Generative Models at Runtime Arxiv 2024 2024/09/01 Github Generator Level IT2T
Can Large Language Models Automatically Jailbreak GPT-4V? CCS 2024 2024/07/23 None Input Level IT2T
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts ACM MM 2024 2024/07/21 None Input Level IT2T
Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything Arxiv 2024 2024/07/01 None Input Level IT2T
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking Arxiv 2024 2024/06/21 None Encoder Level IT2T
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt Arxiv 2024 2024/06/06 Github Generator Level IT2T
Efficient LLM-Jailbreaking by Introducing Visual Modality Arxiv 2024 2024/05/30 None Generator Level IT2T
White-box Multimodal Jailbreaks Against Large Vision-Language Models Arxiv 2024 2024/05/28 None Generator Level IT2T
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character Arxiv 2024 2024/05/25 None Input Level IT2T
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models ECCV 2024 2024/05/14 Github Generator Level IT2T
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast ICML 2024 2024/02/13 Github Decoder Level IT2T
Jailbreaking Attack against Multimodal Large Language Model Arxiv 2024 2024/02/04 None Generator Level IT2T
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models ICLR 2024 Spotlight 2024/01/16 Github Encoder Level IT2T
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs ECCV 2024 2023/11/27 Github Encoder Level IT2T
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts Arxiv 2023 2023/11/15 None Input Level IT2T
FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts Arxiv 2023 2023/11/09 Github Input Level IT2T
Are aligned neural networks adversarially aligned? Arxiv 2023 2023/06/26 None Generator Level IT2T
Visual Adversarial Examples Jailbreak Aligned Large Language Models AAAI 2024 2023/06/22 Github Generator Level IT2T
On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS 2023 2023/05/26 Homepage --- IT2T

Jailbreak Attack of Any-to-Vision Models

Title Venue Date Code Taxonomy Multimodal Model
Unfiltered and Unseen: Universal Multimodal Jailbreak Attacks on Text-to-Image Model Defenses Openreview 2024/11/13 None --- T2I
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step Arxiv 2024 2024/10/4 None --- T2I
RT-Attack: Jailbreaking Text-to-Image Models via Random Token Arxiv 2024 2024/08/25 None Encoder Level T2I
Perception-guided Jailbreak against Text-to-Image Models Arxiv 2024 2024/08/20 None Input Level T2I
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization Arxiv 2024 2024/08/18 None Output Level T2I
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models Arxiv 2024 2024/08/02 None Encoder Level T2I
Jailbreaking Text-to-Image Models with LLM-Based Agents Arxiv 2024 2024/08/01 None Input Level T2I
Automatic Jailbreaking of the Text-to-Image Generative AI Systems Arxiv 2024 2024/05/26 None Input Level T2I
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers ICML 2024 2024/05/18 None Input Level T2I
BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators Arxiv 2024 2024/02/23 None Encoder Level T2I
Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models Arxiv 2023 2023/12/12 Github Input Level T2I
MMA-Diffusion: MultiModal Attack on Diffusion Models CVPR 2024 2023/11/29 Github Encoder Level T2I
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models CVPR 2024 2023/11/29 Github Encoder Level T2I
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now ECCV 2024 2023/10/18 Github Generator Level T2I
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? ICLR 2024 2023/10/16 Github Encoder Level T2I
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution CCS 2024 2023/09/25 Github Input Level T2I
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts ICML 2024 2023/09/12 Github Generator Level T2I
SneakyPrompt: Jailbreaking Text-to-image Generative Models Symposium on Security and Privacy 2024 2023/05/20 Github Output Level T2I
Red-Teaming the Stable Diffusion Safety Filter NeurIPSW 2022 2022/10/03 None Input Level T2I

Jailbreak Attack of Any-to-Any Models

Title Venue Date Code Taxonomy Multimodal Model
Gradient-based Jailbreak Images for Multimodal Fusion Models Arxiv 2024 2024/10/4 Github --- IT2IT
Voice jailbreak attacks against gpt-4o Arxiv 2024 2024/05/29 Github Input Level Any2Any

🛡️Jailbreak Defense

📖Defense-Intro

Current efforts made in the jailbreak defense of multimodal generative models include two lines of work: Discriminative defense and Transformative defense.

  • Discriminative defenses: is constrained to classification tasks for assigning binary labels.

jailbreak_discriminative_defense

  • Transformative Defense: aims to produce appropriate and safe responses in the presence of malicious or adversarial inputs.

jailbreak_transformative_defense

📑Papers

Below are the papers related to jailbreak defense.

Jailbreak Defense of Any-to-Text Models

Title Venue Date Code Taxonomy Multimodal Model
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Arxiv 2024 2024/11/23 None --- IT2T
Uniguard: Towards universal safety guardrails for jailbreak attacks on multimodal large language models Arxiv 2024 2024/11/03 None Input Level IT2T
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector Arxiv 2024 2024/10/30 None Generator Level IT2T
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks Arxiv 2024 2024/10/28 None Input Level IT2T
Information-theoretical principled trade-off between jailbreakability and stealthiness on vision language models Arxiv 2024 2024/10/02 None Input Level IT2T
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks Arxiv 2024 2024/09/11 None Encoder Level IT2T
Bathe: Defense against the jailbreak attack in multimodal large language models by treating harmful instruction as backdoor trigger Arxiv 2024 2024/08/17 None Generator Level IT2T
Defending jailbreak attack in vlms via cross-modality information detector Arxiv 2024 2024/07/31 Github Encoder Level IT2T
Sim-clip: Unsupervised siamese adversarial fine-tuning for robust and semantically-rich vision-language models Arxiv 2024 2024/07/20 None Encoder Level IT2T
Cross-modal safety alignment: Is textual unlearning all you need? Arxiv 2024 2024/05/27 None Generator Level IT2T
Safety alignment for vision language models Arxiv 2024 2024/05/22 None Generator Level IT2T
Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting ECCV 2024 2024/05/14 Github Input Level IT2T
Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML 2024 2024/02/03 Github Generator Level IT2T
Inferaligner: Inference-time alignment for harmlessness through cross-model guidance Arxiv 2024 2024/01/20 Github Encoder Level IT2T
Mllm-protector: Ensuring mllm’s safety without hurting performance Arxiv 2024 2024/01/05 Github Output Level IT2T
Jailguard: A universal detection framework for llm prompt-based attacks Arxiv 2023 2023/12/17 None Encoder Level IT2T
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions ICLR 2024 2023/09/14 Github Generator Level IT2T

Jailbreak Defense of Any-to-Vision Models

Title Venue Date Code Taxonomy Multimodal Model
Safree: Training-free and adaptive guard for safe text-to-image and video generation Arxiv 2024 2024/10/16 None Output Level T2I/T2V
Safree: Training-free and adaptive guard for safe text-to-image and video generation Arxiv 2024 2024/10/16 None Output Level T2I/T2V
Dark miner: Defend against unsafe generation for text-to-image diffusion models Arxiv 2024 2024/09/26 None Generator Level T2I
Score forgetting distillation: A swift, data-free method for machine unlearning in diffusion models Arxiv 2024 2024/09/17 None Generator Level T2I
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts Arxiv 2024 2024/08/02 None Generator Level T2I
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models ICML GenLaw workshop 2024 2024/07/17 None Encoder Level T2I
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models ECCV 2024 2024/07/17 Github Generator Level T2I
Conceptprune: Concept editing in diffusion models via skilled neuron pruning Arxiv 2024 2024/05/29 Github Generator Level T2I
Pruning for Robust Concept Erasing in Diffusion Models Arxiv 2024 2024/05/26 None Generator Level T2I
Defensive unlearning with adversarial training for robust concept erasure in diffusion models Nips 2024 2024/05/24 Github Encoder Level T2I
Unlearning concepts in diffusion model via concept domain correction and concept preserving gradient Arxiv 2024 2024/05/24 None Encoder Level T2I
Espresso: Robust Concept Filtering in Text-to-Image Models Arxiv 2024 2024/04/30 None Output Level T2I
Latent Guard: a Safety Framework for Text-to-image Generation ECCV 2024 2024/04/11 Github Encoder Level T2I
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models ACM CCS 2024 2024/04/10 Github Generator Level T2I
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation ICLR 2024 2024/04/04 Github Generator Level T2I
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts NIPS 2024 2024/03/03 None Input Level T2I
Universal prompt optimizer for safe text-to-image generation Arxiv 2024 2024/02/16 None Input Level T2I
Erasediff: Erasing data influence in diffusion models Arxiv 2024 2024/01/11 None Generator Level T2I
Localization and manipulation of immoral visual cues for safe text-to-image generation WACV 2024 2024/01/01 None Output Level T2I
Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers ECCV 2024 2023/11/29 None Generator Level T2I
Self-discovering interpretable diffusion latent directions for responsible text-to-image generation CVPR 2024 2023/11/28 Github Input Level T2I
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models ECCV 2024 2023/11/27 Github Encoder Level T2I
Mace: Mass concept erasure in diffusion models CVPR 2024 2023/10/19 Github Generator Level T2I
Implicit concept removal of diffusion models ECCV 2024 2023/10/09 None Input Level T2I
Unified concept editing in diffusion models WACV 2024 2023/08/25 Github Generator Level T2I
Towards safe self-distillation of internet-scale text-to-image diffusion models ICML 2023 Workshop on Challenges in Deployable Generative AI 2023/07/12 Github Generator Level T2I
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models CVPR 2024 2023/05/30 Github Generator Level T2I
Erasing concepts from diffusion models ICCV 2023 2023/05/13 Github Generator Level T2I
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models CVPR 2023 2022/11/09 Github Generator Level T2I

Jailbreak Defense of Any-to-Any Models

Title Venue Date Code Taxonomy Multimodal Model

💯Evaluation

⭐️Evaluation Datasets

Used to Any-to-Text Models

Dataset Task Text Source Image Source Volume Access
SafeBench Attack GPT generation Typography 500 Github
AdvBench Attack LLM generation N/A 500 Github
ReadTeam-2K Attack Exist. & GPT Generation N/A 2000 Huggingface
HarmBench Attack & Defense Unpublished N/A 320 Github
HADES Defense GPT generation Typography & Diffusion Generation 750 Github
MM-SafetyBench Defense GPT generation Typography & Diffusion Generation 5040 Github
JailBreakV-28K Defense Adv. Prompt on ReadTeam-2K Blank & Noise & Natural & Synthesize 28000 Huggingface
VLGuard Defense GPT generation Exist. 3000 Huggingface

Used to Any-to-Vision Models

Dataset Task Text Source Image Source Volume Access
NSFW-200 Attack Human curation N/A 200 Github
MMA Attack Exist.& Adv. Prompt N/A 1000 Huggingface
VBCDE-100 Attack Human curation N/A 100 Github
I2P Attack & Defense Real-world Website Real-world Website 4703 Huggingface
Unsafe Diffusion Defense Human curation& Website&Exist. N/A 1434 Github
MACE Defense Human curation Diffusion Generation 200 Github

📚Evaluation Methods

Current evaluation methods are primarily classified into two categories: manual evaluation and automated evaluation.

  • Manual evaluation involves human assessment to determine if the content is toxic, offering a direct and interpretable method of evaluation.
  • Automated approaches assess the safety of multimodal generative models by employing a range of techniques, including detector-based, GPT-based, and rule-based methods.

jailbreak_evaluation

Text Detector

Toxicity detector Access
LLama-Guard Huggingface
LLama-Guard2 Huggingface
Detoxify Github
GPTFUZZER Huggingface
Perspective API Website

Image Detector

Toxicity detector Access
NudeNet Github
Q16 Github
Safety Checker Huggingface
Imgcensor Github
Multi-headed Safety Classifier Github

😉Citation

If you find this work useful in your research, Please kindly cite using the following BibTex:

@article{liu2024jailbreak,
    title={Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey},
    author={Liu, Xuannan and Cui, Xing and Li, Peipei and Li, Zekun and Huang, Huaibo and Xia, Shuhan and Zhang, Miaoxuan and Zou, Yueying and He, Ran},
    journal={arXiv preprint arXiv:2411.09259},
    year={2024},
}