/awesome-Vision-and-Language-Pre-training

Recent Advances in Vision and Language Pre-training (VLP)

Apache License 2.0Apache-2.0

Recent Advances in Vision-and-Language Pre-training (VLP)

Maintained by Feilong Chen. Last update on 2023/03/04.

Table of Contents

Survey

  1. VLP: A Survey on Vision-Language Pre-training, arXiv 2022

Image-based VLP

Representation Learning

  1. Learning Transferable Visual Models From Natural Language Supervision, CLIP, ICML 2021, [code]

  2. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

  3. LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

  4. VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]

  5. VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]

  6. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020

  7. Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)

  8. UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

  9. Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12

  10. InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03

  11. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020, [code]

  12. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04

  13. ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06

  14. DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]

  15. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers, EMNLP 2020

  16. SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission

  17. CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10

  18. Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11

  19. LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12

  20. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021

  21. VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021, [code]

  22. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, ICML 2021, [code]

  23. OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, arXiv 2021

  24. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021, [code]

  25. How Much Can CLIP Benefit Vision-and-Language Tasks?, arXiv 2021, [code]

  26. Unifying Vision-and-Language Tasks via Text Generation, ICML 2021, [code]

  27. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs, ACL 2021, [code]

  28. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, arXiv 2021

  29. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, arXiv 2021, [code]

  30. Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, CVPR2021, [code]

  31. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, ICML 2022, [code]

  32. Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022, [code]

  33. Unpaired Vision-Language Pre-training via Cross-Modal CutMix, ICML 2022

  34. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML 22, [code]

  35. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, ICML 22, [code]

  36. GIT: A Generative Image-to-text Transformer for Vision and Language, arXiv 2022, [code]

  37. CoCa: Contrastive Captioners are Image-Text Foundation Models, arXiv 2022, [code]

  38. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, arXiv 2022, [code]

  39. PaLI: A Jointly-Scaled Multilingual Language-Image Model, arXiv 2022

  40. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, arXiv 2023

  41. Language Is Not All You Need: Aligning Perception with Language Models, arXiv 2023, [code]

  42. Unifying Vision-Language Representation Space with Single-tower Transformer, AAAI 2023

Task-specific

Image Caption

  1. Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

VQA

  1. VQA: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)

  2. TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)

  3. Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.

  4. Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02

  5. TextVQA: TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, arXiv 2022, [code], (TAG)

Visual Dialog

  1. VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)

  2. VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)

  3. VisDial: UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, CVPR 2022

Text-Image Retrieval

  1. Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01

  2. Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.

  3. Text-image retrieval: Learning Relation Alignment for Calibrated Cross-modal Retrieval, ACL 2021.

  4. Text-image retrieval: Dynamic Contrastive Distillation for Image-Text Retrieval, arXiv 2022/07.

  5. Text-image retrieval: Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval, SIGIR 2022.

Visual Language Navigation

  1. VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

Visual Machine Reading Comprehension

  1. VisualMRC: VisualMRC: Machine Reading Comprehension on Document Images, AAAI 2021, (LayoutT5, LayoutBART)

Other Tasks

  1. Visual Relationship Detection: Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations, IEEE Access 2021

Other Analysis

  1. Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]

  2. Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02

  3. Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]

  4. In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,

  5. In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight

  6. In-depth Analysis, A Closer Look at the Robustness of Vision-and-Language Pre-trained Models, arXiv 2020/12

  7. Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight

  8. Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020

  9. Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04

  10. Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VLP

  1. VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

  2. Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)

  3. M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08

  4. BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]

  5. Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop

  6. Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]

  7. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02

  8. ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

  9. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020

  10. Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020

  11. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07

  12. Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11

  13. PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12

  14. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021

Other Transformer-based multimodal networks

  1. Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020

  2. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020

  3. History for Visual Dialog: Do we really need it?, ACL 2020

  4. Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources