/Awesome-Prompting-on-Vision-Language-Model

This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.

Awesome Prompting on Vision-Language Models

# 🤓 What is Prompting on Vision-Language Models?

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. This repo aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models (VLMs): multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion) (Fig. 1).

Fig. 1: This work focuses on three main types of vision-language models.

Reference

This repo lists relevant papers summarized in our survey:

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models. Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr. Preprint 2023. [pdf]

If you find our paper and repo helpful to your research, please cite the following paper:

@article{gu2023survey,
  title={A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models},
  author={Gu, Jindong and Han, Zhen and Chen, Shuo, and Beirami, Ahmad and He, Bailan and Zhang, Gengyuan and Liao, Ruotong and Qin, Yao and Tresp, Volker and Torr, Philip}
  journal={arXiv preprint arXiv:2307.12980},
  year={2023}
}

# 🖇️ Awesome Papers

Prompting Models in Multimodal-to-Text Generation (e.g. on Flamingo)

There are two main types of fusion module approaches based on the integration of visual and textual modalities: encoder-decoder as a multi-modal fusion module and decoder-only as a multi-modal fusion module. Prompting methods can be divided into two main categories (Fig. 2) based on the readability of the templates: hard prompt and soft prompt. Hard prompt encompasses four subcategories: task instruction, in-context learning, retrieval-based prompting, and chain-of-thought prompting. Soft prompts are classified into two strategies: prompt tuning and prefix token tuning, based on whether they internally add new tokens to the model's architecture or simply append them to the input. this study primarily concentrates on prompt methods that avoid altering the base model.

Fig. 2 : Classification of prompting methods.

Title Venue Year Code if available Comment
Unifying Vision-and-Language Tasks via Text Generation ICML 2021 Github Encoder-decoder fusion; Text prefixes as prompt
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision ICLR 2022 Github Encoder-decoder fusion; Text prefixes as prompt
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework ICML 2022 Github Encoder-decoder fusion; Text prefixes as prompt
PaLI: A Jointly-Scaled Multilingual Language-Image Model ICLR 2023 --- Encoder-decoder fusion; Instruction prompt
Multimodal Few-Shot Learning with Frozen Language Models NeurIPS 2021 Page Decoder-only fusion; Image conditional prefix tuning
Flamingo: a Visual Language Model for Few-Shot Learning NeurIPS 2022 Github Decoder-only fusion; Text prompts;
MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning EMNLP 2022 Github Decoder-only fusion; Image conditional prefix tuning
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ICML 2023 Github Decoder-only fusion; Image conditional prefix tuning
Language Models are Unsupervised Multitask Learners OpenAI Blog 2019 Github Task instruction prompt
The Turking Test: Can Language Models Understand Instructions? arXiv 2020 --- Task instruction prompt
Language Models are Few-Shot Learners NeurIPS 2020 --- In-context learning
Learning To Retrieve Prompts for In-Context Learning NAACL-HLT 2022 Github Retrieval-based prompting
Unified Demonstration Retriever for In-Context Learning ACL 2023 Github Retrieval-based prompting
Compositional Exemplars for In-context Learning ICML 2023 Github Retrieval-based prompting
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models NeurIPS 2022 --- Chain-of-thought prompting
Automatic Chain of Thought Prompting in Large Language Models ICLR 2023 Github Chain-of-thought prompting
The Power of Scale for Parameter-Efficient Prompt Tuning EMNLP 2021 --- Prompt tuning
Learning How to Ask: Querying LMs with Mixtures of Soft Prompts NAACL-HLT 2021 Github Prompt tuning
Prefix-Tuning: Optimizing Continuous Prompts for Generation ACL 2021 Github Prefix tuning
Prompt Tuning for Generative Multimodal Pretrained Models ACL 2023 Github Prompt tuning on OFA
Language Is Not All You Need: Aligning Perception with Language Models arXiv 2023 Github Textual instruction prompts
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models arXiv 2023 Page Robustness of prompt tuning on VLMs
Towards Robust Prompts on Vision-Language Models arXiv 2023 --- Robustness of prompt tuning on VLMs

Prompting Model in Image-Text Matching (e.g. on CLIP)

Depending on the target of prompting, existing methods can be classified into three categories: prompting the text encoder, prompting the visual encoder, or jointly prompting both branches as shown in Fig. 2 . These approaches aim to enhance the flexibility and task-specific performance of VLMs.

Fig. 2: Classification of prompting methods on Image-Text Matching VLMs.

Title Venue Year Code if available Comment
Learning Transferable Visual Models From Natural Language Supervision ICML 2021 Github Hard text prompts; Prompt for Image classification
Delving into the Openness of CLIP ACL 2023 Github Hard text prompts for understanding
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models NeurIPS 2022 Github Soft text prompts
Learning to Prompt for Vision-Language Models IJCV 2022 Github Soft text prompts
Prompting Visual-Language Models for Efficient Video Understanding ECCV 2022 Github Soft text prompts
Multitask Vision-Language Prompt Tuning arXiv 2022 Github Soft text prompts
Conditional Prompt Learning for Vision-Language Models CVPR 2022 Github Soft text prompts
Visual Prompt Tuning ECCV 2022 Github Visual patch-wise prompts
Exploring Visual Prompts for Adapting Large-Scale Models arXiv 2022 Github Visual patch-wise prompts
Multitask Vision-Language Prompt Tuning arXiv 2022 Github Visual patch-wise prompts
Unleashing the Power of Visual Prompting At the Pixel Level arXiv 2022 Github Visual patch-wise prompts
Diversity-Aware Meta Visual Prompting CVPR 2023 Github Visual patch-wise prompts
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models arXiv 2022 Github Visual annotation prompts
What does CLIP know about a red circle? Visual prompt engineering for VLMs arXiv 2023 --- Visual annotation prompts
Visual Prompting via Image Inpainting NeurIPS 2022 Github Visual annotation prompts
Unified Vision and Language Prompt Learning arXiv 2023 Github Coupled unified prompting
Multitask Vision-Language Prompt Tuning arXiv 2022 Github Decoupled unified prompting
MaPLe: Multi-modal Prompt Learning CVPR 2023 Github Decoupled unified prompting
Understanding Zero-shot Adversarial Robustness for Large-Scale Models ICLR 2023 Code Adversarial robustness of prompt
Visual Prompting for Adversarial Robustness ICASSP 2023 Github Adversarial robustness of prompt
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation NeurIPS 2021 Github Image-Text Matching Model
Unsupervised Prompt Learning for Vision-Language Models arXiv 2022 Github Unspervised learnable prompts
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models NeurIPS 2022 Github Learnable prompt
Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition arXiv 2023 Github Prompt Pre-Training

Applications & Responsible AI

Title Venue Year Code if available Comment
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition arXiv 2023 Github Prompts for long-tailed multi-label image classification
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models NeurIPS 2022 Github Learnable prompt; Prompts for image classification
LPT: Long-tailed Prompt Tuning for Image Classification ICLR 2023 Github Prompts for long-tailed image classification
Texts as Images in Prompt Tuning for Multi-Label Image Recognition CVPR 2023 Github Prompts for multi-label image classification and detection
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations NeurIPS 2022 Github Prompts for multi-label image classification and recognition
Visual Prompt Tuning for Few-Shot Text Classification ICCL 2022 --- Visual prompts for text classification
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation ICLR 2021 Github Prompts for object detection
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model CVPR 2022 Github Prompts for object detection
PromptDet: Towards Open-vocabulary Detection using Uncurated Images ECCV 2022 Github Prompts for object detection
Optimizing Continuous Prompts for Visual Relationship Detection by Affix-Tuning IEEE Access 2022 --- Soft prompts for visual relation detection
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning ECCV 2022 --- Soft prompts for visual relation detection
Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection ICLR 2023 Github Relation Prompts for video open-vocabulary relation detection
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting CVPR 2022 Github Class-conditioned text prompts for semantic segmentation
Segment Anything ICCV 2023 Github Promptable queries for semantic segmentation
Domain Adaptation via Prompt Learning arXiv 2022 Github Domain-specific textual prompts for domain adaptation
Visual Prompt Tuning for Test-time Domain Adaptation arXiv 2022 --- Prompts for domain adaptation
Learning to Prompt for Continual Learning CVPR 2022 Github Prompts for continual learning
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning ECCV 2022 Github Prompts for continual learning
Prompt Vision Transformer for Domain Generalization arXiv 2022 Github Prompts for domain generalization
Understanding Zero-Shot Adversarial Robustness for Large-Scale Models arXiv 2022 Github Visual prompt tuning under adversarial attack
Visual Prompting for Adversarial Robustness ICASSP 2023 Github Visual prompting to improve the adversarial robustness
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm NAACL 2022 Github Visual prompting vulnerability
Poisoning and Backdooring Contrastive Learning ICLR 2022 --- Backdoor and poisoning attacks on CLIP
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning IEEE 2022 Github Backdoor attack on CLIP
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning ICLR Workshop 2023 --- Defense backdoor attacks on CLIP
Debiasing Vision-Language Models via Biased Prompts arXiv 2023 Github Prompts to alleviate bias

Prompting Model in Text-to-Image Generation (e.g. on Stable Diffusion)

Title Venue Year Code if available Comment
Diffusion Models Beat GANs on Image Synthesis NeurIPS 2021 Github Diffusion models on image generation
Diffusion Models Beat GANs on Image Synthesis NeurIPS 2021 Github Diffusion models on image generation
Denoising Diffusion Probabilistic Models NeurIPS 2020 Github Diffusion models on image generation
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models ICCV 2023 Github Diffusion models on image generation
Investigating Prompt Engineering in Diffusion Models NeurIPS Workshop 2022 --- Semantic prompt design
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models arXiv 2023 Github Diversify generation with prompt; Prompts for synthetic data generation
Is synthetic data from generative models ready for image recognition? ICLR 2023 Github Diversify generation with prompt
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion ICLR 2023 Github Complex control of synthesis results via prompts
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation CVPR 2023 Github Complex control of synthesis results via prompts
Multi-Concept Customization of Text-to-Image Diffusion CVPR 2023 Github Complex control of synthesis results via prompts
Prompt-to-Prompt Image Editing with Cross Attention Control arXiv 2022 --- Complex control of synthesis results via prompts
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis ICLR 2023 Github Controllable text-to-image generation
Diffusion Self-Guidance for Controllable Image Generation arXiv 2023 Page Controllable text-to-image generation
Imagic: Text-Based Real Image Editing with Diffusion Models CVPR 2023 Github Controllable text-to-image generation
Adding Conditional Control to Text-to-Image Diffusion Models arXiv 2023 Github Controllable text-to-image generation
Prompt-to-Prompt Image Editing with Cross Attention Control arXiv 2022 Github Complex control of synthesis results via prompts
ImaginaryNet: Learning Object Detectors without Real Images and Annotations ICLR 2023 Github Prompts for synthetic data generation
Is synthetic data from generative models ready for image recognition? ICLR 2023 Github Prompts for synthetic data generation
Make-A-Video: Text-to-Video Generation without Text-Video Data ICLR 2023 Page Prompts for text-to-video generation
Imagen Video: High Definition Video Generation with Diffusion Models arXiv 2022 Page Prompts for text-to-video generation
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing arXiv 2023 Github Prompts for text-to-video generation
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation ICCV 2023 Github Prompts for text-to-video generation
DiffRF: Rendering-Guided 3D Radiance Field Diffusion CVPR 2023 Page Prompts for text-to-3D generation
DreamFusion: Text-to-3D using 2D Diffusion arXiv 2022 Page Prompts for text-to-3D generation
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models CVPR 2023 Page Prompts for text-to-3D generation
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model arXiv 2022 Page Prompts for text-to-motion generation
FLAME: Free-form Language-based Motion Synthesis & Editing AAAI 2023 Github Prompts for text-to-motion generation
MDM: Human Motion Diffusion Model ICLR 2023 Github Prompts for text-to-motion generation
Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models arXiv 2023 --- Prompts for complex tasks
Multimodal Procedural Planning via Dual Text-Image Prompting arXiv 2023 Github Prompts for complex tasks
Prompt Stealing Attacks Against Text-to-Image Generation Models arXiv 2023 --- Prompts for responsible AI
Membership Inference Attacks Against Text-to-image Generation Models arXiv 2022 --- Membership attacks against text-to-image models
Are Diffusion Models Vulnerable to Membership Inference Attacks? ICML 2023 Github Membership attacks against text-to-image models
A Reproducible Extraction of Training Images from Diffusion Models arXiv 2023 Github Membership attacks against text-to-image models
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness arXiv 2023 Github Prompts on text-to-image models considering fairness
Social Biases through the Text-to-Image Generation Lens arXiv 2023 --- Prompts on text-to-image models considering biases
T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation ACL 2023 --- Prompts on text-to-image models considering biases
Stable Bias: Analyzing Societal Representations in Diffusion Models arXiv 2023 --- Prompts on text-to-image models considering biases
A Pilot Study of Query-Free Adversarial Attack Against Stable Diffusion CVPR 2023 --- Adversarial robustness of text-to-image models
Diffusion Models for Imperceptible and Transferable Adversarial Attack arXiv 2023 Github Adversarial robustness of text-to-image models
Diffusion Models for Adversarial Purification ICML 2022 Github Adversarial robustness of text-to-image models
Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis arXiv 2022 --- Backdoor attack on text-to-image models
Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning arXiv 2023 --- Backdoor attack on text-to-image models
Zero-Day Backdoor Attack against Text-to-Image Diffusion Models via Personalization arXiv 2023 --- Backdoor attack on text-to-image models

# 📬 Contact

Please contact us (jindong.gu@outlook.com, chenshuo.cs@outlook.com) if

  • you would like to add your papers in this repo,
  • you find any mistakes in this repo,
  • you have any suggestions for this repo.