Awesome Prompting Papers in Computer Vision

A curated list of prompt-based papers in computer vision and vision-language learning.

Keywords

  • Task tag, e.g.,
  • Abbreviation tag, e.g.,
  • Characteristic tag: Some characteristic makes this paper unique, e.g.,
  • Bold font: We highlight some pilot work that may contribute to the prevalence of visual prompting.

Vision Prompt

This section collects papers prompting pretrained vision foundation models (e.g., ViT) for parameter-efficient adaptation.

  • Learning to Prompt for Continual Learning [paper] [code]

    CVPR 2022

  • Visual Prompt Tuning [paper] [code]

    ECCV 2022

  • DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning [paper] [code]

    ECCV 2022

  • AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition [paper] [code]

    NeurIPS 2022

  • Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning [paper] [code]

    NeurIPS 2022

  • P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting [paper] [code]

    NeurIPS 2022

  • Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models [paper] [code]

    NeurIPS 2022

  • Visual Prompting via Image Inpainting [paper] [code]

    NeurIPS 2022

  • Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation [paper]

    AAAI 2023

  • LPT: Long-tailed Prompt Tuning for Image Classification [paper]

    ICLR 2023

  • Diversity-Aware Meta Visual Prompting [paper] [code]

    CVPR 2023

  • Semantic Prompt for Few-Shot Image Recognition [paper]

    CVPR 2023

  • Visual Prompt Tuning for Generative Transfer Learning [paper] [code]

    CVPR 2023

  • CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [paper] [code]

    CVPR 2023

  • Images Speak in Images: A Generalist Painter for In-Context Visual Learning [paper] [code]

    CVPR 2023

  • PIVOT: Prompting for Video Continual Learning [paper]

    CVPR 2023

  • Learning Expressive Prompting With Residuals for Vision Transformers [paper]

    CVPR 2023

  • BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning [paper] [code]

    CVPR 2023

  • Visual Prompt Multi-Modal Tracking [paper] [code]

    CVPR 2023

  • A-La-Carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting [paper]

    CVPR 2023

  • Understanding and Improving Visual Prompting: A Label-Mapping Perspective [paper] [code]

    CVPR 2023

  • Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning [paper] [code]

    CVPR 2023

  • Explicit Visual Prompting for Low-Level Structure Segmentations low-level segmentation [paper] [code]

    CVPR 2023

  • Understanding and Improving Visual Prompting: A Label-Mapping Perspective [paper] [code]

    CVPR 2023

ArXiv Papers

  • Exploring Visual Prompts for Adapting Large-Scale Models [paper] [code]

    arXiv 2022/03

  • Vision Transformer Adapter for Dense Predictions [paper] [code]

    arXiv 2022/05

  • Neural Prompt Search [paper] [code]

    arXiv 2022/06

  • Convolutional Bypasses Are Better Vision Transformer Adapters [paper] [code]

    arXiv 2022/07

  • Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets [paper]

    arXiv 2022/08

  • Prompt Vision Transformer for Domain Generalization [paper]

    arXiv 2022/08

  • Prompt-Matched Semantic Segmentation [paper]

    arXiv 2022/08

  • Visual Prompt Tuning for Test-time Domain Adaptation [paper]

    arXiv 2022/10

  • Visual Prompting for Adversarial Robustness [paper]

    arXiv 2022/10

  • Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers [paper] [code]

    arXiv 2022/10

  • Towards a Unified View on Visual Parameter-Efficient Transfer Learning [paper] [code]

    arXiv 2022/10

  • Multitask Vision-Language Prompt Tuning [paper] [code]

    arXiv 2022/11

Vision-Language Prompt

This section collects papers prompting pretrained vision-language foundation models (e.g., CLIP) for parameter-efficient adaptation.

  • Learning Transferable Visual Models From Natural Language Supervision [paper] [code]

    ICML 2021

  • Learning to Prompt for Vision-Language Models [paper] [code]

    IJCV 2022

  • Prompt Distribution Learning [paper]

    CVPR 2022

  • Conditional Prompt Learning for Vision-Language Models [paper] [code]

    CVPR 2022

  • DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [paper] [code]

    CVPR 2022

  • Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [paper] [code]

    CVPR 2022

  • PointCLIP: Point Cloud Understanding by CLIP [paper] [code]

    CVPR 2022

  • VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [paper] [code]

    CVPR 2022

  • A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models [paper]

    ACL 2022

  • Can Language Understand Depth? [paper] [code]

    ACM MM 2022

  • Expanding Language-Image Pretrained Models for General Video Recognition [paper] [code]

    ECCV 2022

  • Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [paper] [code]

    ECCV 2022

  • OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression [paper]

    NeurIPS 2022

  • Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [paper] [code]

    NeurIPS 2022

  • Learning to Decompose Visual Features with Latent Textual Prompts [paper]

    ICLR 2023

  • PLOT: Prompt Learning with Optimal Transport for Vision-Language Models [paper] [code]

    ICLR 2023

  • Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [paper] [code]

    CVPR 2023

  • Open-Set Fine-Grained Retrieval Via Prompting Vision-Language Evaluator [paper]

    CVPR 2023

  • Multimodal Prompting With Missing Modalities for Visual Recognition [paper] [code]

    CVPR 2023

  • Efficient Multimodal Fusion Via Interactive Prompting [paper]

    CVPR 2023

  • Hierarchical Prompt Learning for Multi-Task Learning [paper] [code]

    CVPR 2023

  • Text-Visual Prompting for Efficient 2D Temporal Video Grounding [paper]

    CVPR 2023

  • VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval [paper] [code]

    CVPR 2023

  • MaPLe: Multi-modal Prompt Learning [paper] [code]

    CVPR 2023

  • Texts as Images in Prompt Tuning for Multi-Label Image Recognition [paper] [code]

    CVPR 2023

  • Vita-CLIP: Video and Text Adaptive CLIP Via Multimodal Prompting [paper] [code]

    CVPR 2023

  • LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [paper] [code]

    CVPR 2023

  • $\pi$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation [paper] [code]

    ICML 2023

  • POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models [paper] [code]

    ICML 2023

  • Rethinking the Openness of CLIP [paper] [code]

    ACL 2023

ArXiv Papers

  • Colorful Prompt Tuning for Pre-trained Vision-Language Models [paper]

    arXiv 2021/08

  • ActionCLIP: A New Paradigm for Video Action Recognition [paper] [code]

    arXiv 2021/09

  • CLIP-Adapter: Better Vision-Language Models with Feature Adapters [paper] [code]

    arXiv 2021/10

  • Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization [paper]

    arXiv 2021/11

  • Prompting Visual-Language Models for Efficient Video Understanding [paper] [code]

    arXiv 2021/12 task task task

  • Unsupervised Prompt Learning for Vision-Language Models [paper] [code]

    arXiv 2022/04

  • Prompt-aligned Gradient for Prompt Tuning [paper] [code]

    arXiv 2022/05

  • Parameter-Efficient Image-to-Video Transfer Learning [paper]

    arXiv 2022/06 task

  • DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations [paper]

    arXiv 2022/06 task

  • Prompt Tuning for Generative Multimodal Pretrained Models [paper] [code]

    arXiv 2022/06

  • Prompt Tuning with Soft Context Sharing for Vision-Language Models [paper]

    arXiv 2022/08

  • CPL: Counterfactual Prompt Learning for Vision and Language Models [paper] [code]

    arXiv 2022/10

  • Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [paper] [code]

    arXiv 2022/10

  • Unified Vision and Language Prompt Learning [paper]

    arXiv 2022/10

  • Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation [paper]

    arXiv 2022/10

  • Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition [paper] [code]

    arXiv 2023/04

Language-Interactable Prompt

Language-interactable prompter develops zero/few-shot capabilities by prompting several independent foundational models (VLMs, LLMs, VMs, etc.) with the language interface. One of the most attractive applications is multimodal chatbot.

  • Multimodal Few-Shot Learning with Frozen Language Models [paper]

    NeurIPS 2021

  • An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [paper] [code]

    AAAI 2022

  • VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning [paper] [code]

    CVPR 2022

  • Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [paper] [code]

    ICLR 2023

Arxiv Papers

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [paper] [code] [demo] arXiv 2023/03

  • Chameleon: Plug-and-play compositional reasoning with large language models [paper] [code] arXiv 2023/04

  • ClipCap: CLIP Prefix for Image Captioning [paper] [code]

    arXiv 2021/11

  • Flamingo: a Visual Language Model for Few-Shot Learning [paper]

    arXiv 2022/04

  • Language Models Can See: Plugging Visual Controls in Text Generation [paper] [code]

    arXiv 2022/05

  • Zero-Shot Video Question Answering via Frozen Bidirectional Language Models [paper]

    arXiv 2022/06

  • Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning [paper]

    arXiv 2022/06

Vision-Language Instruction Tuning

The goal of vision-language instruction tuning is to train a model that can effectively understand instructions for general-purpose multimodal tasks.

  • Visual Instruction Tuning [paper] [code] [demo]

    arXiv 2023/04

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [paper] [code] [demo]

    arXiv 2023/04

  • Otter: A Multi-Modal Model with In-Context Instruction Tuning [paper] [code] [demo]

    arXiv 2023/05

  • MultiModal-GPT: A Vision and Language Model for Dialogue with Humans [paper] [code]

    arXiv 2023/05

  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [paper] [code]

    arXiv 2023/05

More Resources

  • PromptPapers: A comprehensive curated list for prompting papers (mainly in natural language processing)
  • Awesome Multimodal Assistant: a curated list for vision-language instruction tuning and LLM-based chatbot.