Awesome Prompting on Vision-Language Models

# 🤓 What is Prompting on Vision-Language Models?

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. This repo aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models (VLMs): multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion) (Fig. 1).

Fig. 1: This work focuses on three main types of vision-language models.

Reference

This repo lists relevant papers summarized in our survey:

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models. Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr. Preprint 2023. [pdf]

If you find our paper and repo helpful to your research, please cite the following paper:

@article{gu2023survey,
  title={A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models},
  author={Gu, Jindong and Han, Zhen and Chen, Shuo, and Beirami, Ahmad and He, Bailan and Zhang, Gengyuan and Liao, Ruotong and Qin, Yao and Tresp, Volker and Torr, Philip}
  journal={arXiv preprint arXiv:2307.12980},
  year={2023}
}

# 🖇️ Awesome Papers

Prompting Model in Multimodal-to-Text Generation
Prompting Model in Image-Text Matching
Prompting Model in Text-to-Image Generation

Prompting Models in Multimodal-to-Text Generation (e.g. on Flamingo)

There are two main types of fusion module approaches based on the integration of visual and textual modalities: encoder-decoder as a multi-modal fusion module and decoder-only as a multi-modal fusion module. Prompting methods can be divided into two main categories (Fig. 2) based on the readability of the templates: hard prompt and soft prompt. Hard prompt encompasses four subcategories: task instruction, in-context learning, retrieval-based prompting, and chain-of-thought prompting. Soft prompts are classified into two strategies: prompt tuning and prefix token tuning, based on whether they internally add new tokens to the model's architecture or simply append them to the input. this study primarily concentrates on prompt methods that avoid altering the base model.

Fig. 2 : Classification of prompting methods.

Title	Venue	Year	Code if available	Comment
Unifying Vision-and-Language Tasks via Text Generation	ICML	2021	Github	Encoder-decoder fusion; Text prefixes as prompt
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	ICLR	2022	Github	Encoder-decoder fusion; Text prefixes as prompt
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	ICML	2022	Github	Encoder-decoder fusion; Text prefixes as prompt
PaLI: A Jointly-Scaled Multilingual Language-Image Model	ICLR	2023	---	Encoder-decoder fusion; Instruction prompt
Multimodal Few-Shot Learning with Frozen Language Models	NeurIPS	2021	Page	Decoder-only fusion; Image conditional prefix tuning
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022	Github	Decoder-only fusion; Text prompts;
MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning	EMNLP	2022	Github	Decoder-only fusion; Image conditional prefix tuning
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ICML	2023	Github	Decoder-only fusion; Image conditional prefix tuning
Language Models are Unsupervised Multitask Learners	OpenAI Blog	2019	Github	Task instruction prompt
The Turking Test: Can Language Models Understand Instructions?	arXiv	2020	---	Task instruction prompt
Language Models are Few-Shot Learners	NeurIPS	2020	---	In-context learning
Learning To Retrieve Prompts for In-Context Learning	NAACL-HLT	2022	Github	Retrieval-based prompting
Unified Demonstration Retriever for In-Context Learning	ACL	2023	Github	Retrieval-based prompting
Compositional Exemplars for In-context Learning	ICML	2023	Github	Retrieval-based prompting
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	NeurIPS	2022	---	Chain-of-thought prompting
Automatic Chain of Thought Prompting in Large Language Models	ICLR	2023	Github	Chain-of-thought prompting
The Power of Scale for Parameter-Efficient Prompt Tuning	EMNLP	2021	---	Prompt tuning
Learning How to Ask: Querying LMs with Mixtures of Soft Prompts	NAACL-HLT	2021	Github	Prompt tuning
Prefix-Tuning: Optimizing Continuous Prompts for Generation	ACL	2021	Github	Prefix tuning
Prompt Tuning for Generative Multimodal Pretrained Models	ACL	2023	Github	Prompt tuning on OFA
Language Is Not All You Need: Aligning Perception with Language Models	arXiv	2023	Github	Textual instruction prompts
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models	arXiv	2023	Page	Robustness of prompt tuning on VLMs
Towards Robust Prompts on Vision-Language Models	arXiv	2023	---	Robustness of prompt tuning on VLMs

Prompting Model in Image-Text Matching (e.g. on CLIP)

Depending on the target of prompting, existing methods can be classified into three categories: prompting the text encoder, prompting the visual encoder, or jointly prompting both branches as shown in Fig. 2 . These approaches aim to enhance the flexibility and task-specific performance of VLMs.

Fig. 2: Classification of prompting methods on Image-Text Matching VLMs.

Title	Venue	Year	Code if available	Comment
Learning Transferable Visual Models From Natural Language Supervision	ICML	2021	Github	Hard text prompts; Prompt for Image classification
Delving into the Openness of CLIP	ACL	2023	Github	Hard text prompts for understanding
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models	NeurIPS	2022	Github	Soft text prompts
Learning to Prompt for Vision-Language Models	IJCV	2022	Github	Soft text prompts
Prompting Visual-Language Models for Efficient Video Understanding	ECCV	2022	Github	Soft text prompts
Multitask Vision-Language Prompt Tuning	arXiv	2022	Github	Soft text prompts
Conditional Prompt Learning for Vision-Language Models	CVPR	2022	Github	Soft text prompts
Visual Prompt Tuning	ECCV	2022	Github	Visual patch-wise prompts
Exploring Visual Prompts for Adapting Large-Scale Models	arXiv	2022	Github	Visual patch-wise prompts
Multitask Vision-Language Prompt Tuning	arXiv	2022	Github	Visual patch-wise prompts
Unleashing the Power of Visual Prompting At the Pixel Level	arXiv	2022	Github	Visual patch-wise prompts
Diversity-Aware Meta Visual Prompting	CVPR	2023	Github	Visual patch-wise prompts
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models	arXiv	2022	Github	Visual annotation prompts
What does CLIP know about a red circle? Visual prompt engineering for VLMs	arXiv	2023	---	Visual annotation prompts
Visual Prompting via Image Inpainting	NeurIPS	2022	Github	Visual annotation prompts
Unified Vision and Language Prompt Learning	arXiv	2023	Github	Coupled unified prompting
Multitask Vision-Language Prompt Tuning	arXiv	2022	Github	Decoupled unified prompting
MaPLe: Multi-modal Prompt Learning	CVPR	2023	Github	Decoupled unified prompting
Understanding Zero-shot Adversarial Robustness for Large-Scale Models	ICLR	2023	Code	Adversarial robustness of prompt
Visual Prompting for Adversarial Robustness	ICASSP	2023	Github	Adversarial robustness of prompt
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation	NeurIPS	2021	Github	Image-Text Matching Model
Unsupervised Prompt Learning for Vision-Language Models	arXiv	2022	Github	Unspervised learnable prompts
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models	NeurIPS	2022	Github	Learnable prompt
Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition	arXiv	2023	Github	Prompt Pre-Training

Applications & Responsible AI

Title	Venue	Year	Code if available	Comment
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition	arXiv	2023	Github	Prompts for long-tailed multi-label image classification
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models	NeurIPS	2022	Github	Learnable prompt; Prompts for image classification
LPT: Long-tailed Prompt Tuning for Image Classification	ICLR	2023	Github	Prompts for long-tailed image classification
Texts as Images in Prompt Tuning for Multi-Label Image Recognition	CVPR	2023	Github	Prompts for multi-label image classification and detection
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations	NeurIPS	2022	Github	Prompts for multi-label image classification and recognition
Visual Prompt Tuning for Few-Shot Text Classification	ICCL	2022	---	Visual prompts for text classification
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation	ICLR	2021	Github	Prompts for object detection
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model	CVPR	2022	Github	Prompts for object detection
PromptDet: Towards Open-vocabulary Detection using Uncurated Images	ECCV	2022	Github	Prompts for object detection
Optimizing Continuous Prompts for Visual Relationship Detection by Affix-Tuning	IEEE Access	2022	---	Soft prompts for visual relation detection
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning	ECCV	2022	---	Soft prompts for visual relation detection
Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection	ICLR	2023	Github	Relation Prompts for video open-vocabulary relation detection
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting	CVPR	2022	Github	Class-conditioned text prompts for semantic segmentation
Segment Anything	ICCV	2023	Github	Promptable queries for semantic segmentation
Domain Adaptation via Prompt Learning	arXiv	2022	Github	Domain-specific textual prompts for domain adaptation
Visual Prompt Tuning for Test-time Domain Adaptation	arXiv	2022	---	Prompts for domain adaptation
Learning to Prompt for Continual Learning	CVPR	2022	Github	Prompts for continual learning
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning	ECCV	2022	Github	Prompts for continual learning
Prompt Vision Transformer for Domain Generalization	arXiv	2022	Github	Prompts for domain generalization
Understanding Zero-Shot Adversarial Robustness for Large-Scale Models	arXiv	2022	Github	Visual prompt tuning under adversarial attack
Visual Prompting for Adversarial Robustness	ICASSP	2023	Github	Visual prompting to improve the adversarial robustness
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm	NAACL	2022	Github	Visual prompting vulnerability
Poisoning and Backdooring Contrastive Learning	ICLR	2022	---	Backdoor and poisoning attacks on CLIP
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning	IEEE	2022	Github	Backdoor attack on CLIP
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning	ICLR Workshop	2023	---	Defense backdoor attacks on CLIP
Debiasing Vision-Language Models via Biased Prompts	arXiv	2023	Github	Prompts to alleviate bias

Prompting Model in Text-to-Image Generation (e.g. on Stable Diffusion)

Title	Venue	Year	Code if available	Comment
Diffusion Models Beat GANs on Image Synthesis	NeurIPS	2021	Github	Diffusion models on image generation
Diffusion Models Beat GANs on Image Synthesis	NeurIPS	2021	Github	Diffusion models on image generation
Denoising Diffusion Probabilistic Models	NeurIPS	2020	Github	Diffusion models on image generation
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models	ICCV	2023	Github	Diffusion models on image generation
Investigating Prompt Engineering in Diffusion Models	NeurIPS Workshop	2022	---	Semantic prompt design
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models	arXiv	2023	Github	Diversify generation with prompt; Prompts for synthetic data generation
Is synthetic data from generative models ready for image recognition?	ICLR	2023	Github	Diversify generation with prompt
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion	ICLR	2023	Github	Complex control of synthesis results via prompts
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	CVPR	2023	Github	Complex control of synthesis results via prompts
Multi-Concept Customization of Text-to-Image Diffusion	CVPR	2023	Github	Complex control of synthesis results via prompts
Prompt-to-Prompt Image Editing with Cross Attention Control	arXiv	2022	---	Complex control of synthesis results via prompts
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis	ICLR	2023	Github	Controllable text-to-image generation
Diffusion Self-Guidance for Controllable Image Generation	arXiv	2023	Page	Controllable text-to-image generation
Imagic: Text-Based Real Image Editing with Diffusion Models	CVPR	2023	Github	Controllable text-to-image generation
Adding Conditional Control to Text-to-Image Diffusion Models	arXiv	2023	Github	Controllable text-to-image generation
Prompt-to-Prompt Image Editing with Cross Attention Control	arXiv	2022	Github	Complex control of synthesis results via prompts
ImaginaryNet: Learning Object Detectors without Real Images and Annotations	ICLR	2023	Github	Prompts for synthetic data generation
Is synthetic data from generative models ready for image recognition?	ICLR	2023	Github	Prompts for synthetic data generation
Make-A-Video: Text-to-Video Generation without Text-Video Data	ICLR	2023	Page	Prompts for text-to-video generation
Imagen Video: High Definition Video Generation with Diffusion Models	arXiv	2022	Page	Prompts for text-to-video generation
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing	arXiv	2023	Github	Prompts for text-to-video generation
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation	ICCV	2023	Github	Prompts for text-to-video generation
DiffRF: Rendering-Guided 3D Radiance Field Diffusion	CVPR	2023	Page	Prompts for text-to-3D generation
DreamFusion: Text-to-3D using 2D Diffusion	arXiv	2022	Page	Prompts for text-to-3D generation
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models	CVPR	2023	Page	Prompts for text-to-3D generation
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model	arXiv	2022	Page	Prompts for text-to-motion generation
FLAME: Free-form Language-based Motion Synthesis & Editing	AAAI	2023	Github	Prompts for text-to-motion generation
MDM: Human Motion Diffusion Model	ICLR	2023	Github	Prompts for text-to-motion generation
Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models	arXiv	2023	---	Prompts for complex tasks
Multimodal Procedural Planning via Dual Text-Image Prompting	arXiv	2023	Github	Prompts for complex tasks
Prompt Stealing Attacks Against Text-to-Image Generation Models	arXiv	2023	---	Prompts for responsible AI
Membership Inference Attacks Against Text-to-image Generation Models	arXiv	2022	---	Membership attacks against text-to-image models
Are Diffusion Models Vulnerable to Membership Inference Attacks?	ICML	2023	Github	Membership attacks against text-to-image models
A Reproducible Extraction of Training Images from Diffusion Models	arXiv	2023	Github	Membership attacks against text-to-image models
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness	arXiv	2023	Github	Prompts on text-to-image models considering fairness
Social Biases through the Text-to-Image Generation Lens	arXiv	2023	---	Prompts on text-to-image models considering biases
T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation	ACL	2023	---	Prompts on text-to-image models considering biases
Stable Bias: Analyzing Societal Representations in Diffusion Models	arXiv	2023	---	Prompts on text-to-image models considering biases
A Pilot Study of Query-Free Adversarial Attack Against Stable Diffusion	CVPR	2023	---	Adversarial robustness of text-to-image models
Diffusion Models for Imperceptible and Transferable Adversarial Attack	arXiv	2023	Github	Adversarial robustness of text-to-image models
Diffusion Models for Adversarial Purification	ICML	2022	Github	Adversarial robustness of text-to-image models
Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis	arXiv	2022	---	Backdoor attack on text-to-image models
Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning	arXiv	2023	---	Backdoor attack on text-to-image models
Zero-Day Backdoor Attack against Text-to-Image Diffusion Models via Personalization	arXiv	2023	---	Backdoor attack on text-to-image models

# 📬 Contact

Please contact us (jindong.gu@outlook.com, chenshuo.cs@outlook.com) if

you would like to add your papers in this repo,
you find any mistakes in this repo,
you have any suggestions for this repo.

shuxjweb/Awesome-Prompting-on-Vision-Language-Model