/awesome-vision-and-language

A curated list of awesome vision and language resources (still under construction... stay tuned!)

Awesome Vision-and-Language: Awesome

A curated list of awesome vision and language resources, inspired by awesome-computer-vision.

Table Of Contents

Survey

Title Conference / Journal Paper Code Remarks
A Survey of Current Datasets for Vision and Language Research 2015 EMNLP 1506.06833
Multimodal Machine Learning: A Survey and Taxonomy 1705.09406
A Comprehensive Survey of Deep Learning for Image Captioning 1810.04020
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods 1907.09358
A Survey of Scene Graph Generation and Application Scene-Graph-Survey
Challenges and Prospects in Vision and Language Research 1904.09317
Deep Multimodal Representation Learning: A Survey 2019 ACCESS ACCESS 2019
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications 1911.03977
Vision and Language: from Visual Perception to Content Creation 2020 APSIPA 1912.11872
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends 2010.09522

Dataset

Title Conference / Journal Paper Code Remarks
VQA: Visual Question Answering 2015 ICCV 1505.00468 visualqa
Visual Storytelling 2016 NAACL 1604.03968 ai-visual-storytelling-seq2seq VIST
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations 2017 IJCV 1602.07332 visual_genome_python_driver visualgenome
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning 2017 CVPR 1612.06890
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions 2018 CVPR 1705.08421 AVA
Embodied Question Answering 2018 CVPR 1711.11543 embodiedqa
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments 2018 CVPR 1711.07280 bringmeaspoon
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering 2019 CVPR 1902.09506 visualreasoning
From Recognition to Cognition: Visual Commonsense Reasoning 2019 CVPR 1811.10830 r2c VCR
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research 2019 ICCV 1904.03493
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning 2020 NeurIPS 2010.00763 Bongard-LOGO
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions 2022 CVPR 2205.13803 Bongard-HOI

Image Captioning

Title Conference / Journal Paper Code Remarks
Long-term Recurrent Convolutional Networks for Visual Recognition and Description 2015 CVPR 1411.4389
Deep Visual-Semantic Alignments for Generating Image Descriptions 2015 CVPR 1412.2306
Show and Tell A Neural Image Caption Generator 2015 CVPR 1411.4555 show_and_tell.tensorflow
Show, Attend and Tell Neural Image Caption Generation with Visual Attention 2015 ICML 1502.03044 show-attend-and-tell
From Captions to Visual Concepts and Back 2015 CVPR 1411.4952 visual-concepts
Image Captioning with Semantic Attention 2016 CVPR 1603.03925 semantic-attention
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning 2017 CVPR 1612.01887 AdaptiveAttention
Self-critical Sequence Training for Image Captioning 2017 CVPR 1612.00563
A Hierarchical Approach for Generating Descriptive Image Paragraphs 2017 CVPR 1611.06607
Deep reinforcement learning-based image captioning with embedding reward 2017 CVPR 1704.03899
Semantic compositional networks for visual captioning 2017 CVPR 1611.08002 Semantic_Compositional_Nets
StyleNet: Generating Attractive Visual Captions with Styles 2017 CVPR CVPR 2017 stylenet
Training for Diversity in Image Paragraph Captioning 2018 EMNLP ENNLP 2018 image-paragraph-captioning
Neural Baby Talk 2018 CVPR 1803.09845 NeuralBabyTalk
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering 2018 CVPR 1707.07998
“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention 2018 ECCV 1807.03871
Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation 2019 AAAI 1805.08191
Unsupervised Image Captioning 2019 CVPR 1811.10787 unsupervised_captioning
Context-aware visual policy network for fine-grained image captioning 2019 TPAMI 1906.02365 CAVP
Dense Relational Captioning Triple-Stream Networks for Relationship-Based Captioning 2019 CVPR 1903.05942
Describing like Humans on Diversity in Image Captioning 2019 CVPR 1903.12020
Good News, Everyone! Context driven entity-aware captioning for news images 2019 CVPR 1904.01475
Auto-Encoding Scene Graphs for Image Captioning 2019 CVPR 1812.02378 SGAE
Unsupervised Image Captioning 2019 CVPR 1811.10787 unsupervised_captioning
MSCap: Multi-Style Image Captioning with Unpaired Stylized Text 2019 CVPR CVPR 2019
Robust Change Captioning 2019 ICCV 1901.02527
Attention on Attention for Image Captioning 2019 ICCV 1908.06954
Context-Aware Group Captioning via Self-Attention and Contrastive Features 2020 CVPR 2004.03708
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs 2020 CVPR 2003.00387 asg2cap
Comprehensive Image Captioning via Scene Graph Decomposition 2020 ECCV 2007.11731 Sub-GC
Are scene graphs good enough to improve Image Captioning? 2020 AACL 2009.12313
SG2Caps: Revisiting Scene Graphs for Image Captioning 2021 arxiv 2102.04990

Image Retrieval

Title Conference / Journal Paper Code Remarks
Visual Word2Vec (vis-w2v) Learning Visually Grounded Word Embeddings Using Abstract Scenes 2016 CVPR 1511.07067 VisualWord2Vec
Composing Text and Image for Image Retrieval - An Empirical Odyssey 2019 CVPR 1812.07119 tirg
Learning Relation Alignment for Calibrated Cross-modal Retrieval 2021 ACL 2105.13868 IAIS
ImageCoDe: Image Retrieval from Contextual Descriptions 2022 ACL 2203.15867 ImageCoDe
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective 2407.15239

Scene Text Recognition

Title Conference / Journal Paper Code Remarks
Towards Unconstrained End-to-End Text Spotting 2019 ICCV 1908.09231
What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis 2019 ICCV 1904.01906 clovaai

Scene Graph

Title Conference / Journal Paper Code Remarks
Image Retrieval Using Scene Graphs 2015 CVPR 7298990
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations 2017 IJCV 1602.07332 visual_genome_python_driver visualgenome
Scene Graph Generation by Iterative Message Passing 2017 CVPR 1701.02426 scene-graph-TF-release
Scene Graph Generation from Objects, Phrases and Region Captions 2017 ICCV 1707.09700 MSDN
Neural Motifs: Scene Graph Parsing with Global Context 2018 CVPR 1711.06640 neural-motifs
Generating Triples with Adversarial Networks for Scene Graph Construction 2018 AAAI 1802.02598
LinkNet: Relational Embedding for Scene Graph 2018 NIPS 1811.06410
Image Generation from Scene Graphs 2018 CVPR 1804.01622 sg2im
Graph R-CNN for Scene Graph Generation 2018 ECCV 1808.00191 graph-rcnn.pytorch
Scene Graph Generation with External Knowledge and Image Reconstruction 2019 CVPR 1904.00560
Specifying Object Attributes and Relations in Interactive Scene Generation 2019 ICCV 1909.05379 scene_generation
Attentive Relational Networks for Mapping Images to Scene Graphs 2019 CVPR 1811.10696
Exploring Context and Visual Pattern of Relationship for Scene Graph Generation 2019 CVPR sceneGraph_Mem
Graphical Contrastive Losses for Scene Graph Parsing 2019 CVPR 1903.02728 ContrastiveLosses4VRD
Knowledge-Embedded Routing Network for Scene Graph Generation 2019 CVPR 1903.03326 KERN
Learning to Compose Dynamic Tree Structures for Visual Contexts 2019 CVPR 1812.01880 VCTree
Counterfactual Critic Multi-Agent Training for Scene Graph Generation 2019 ICCV 1812.02347
Scene Graph Prediction with Limited Labels 2019 ICCV 1904.11622 limited-label
Unbiased Scene Graph Generation from Biased Training 2020 CVPR 2002.11949 Scene-Graph-Benchmark
GPS-Net Graph Property Sensing Network for Scene Graph Generation 2020 CVPR 2003.12962 GPS-Net
Learning Visual Commonsense for Robust Scene Graph Generation 2020 ECCV 2006.09623
Sketching Image Gist Human-Mimetic Hierarchical Scene Graph Generation 2020 ECCV 2007.08760 het-eccv20

text2image

Title Conference / Journal Paper Code Remarks
Generative Adversarial Text to Image Synthesis 2016 ICML 1605.05396 icml2016
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks 2017 ICCV 1612.03242 StackGAN
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks 2018 CVPR 1711.10485 AttnGAN
Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network 2018 CVPR 1802.09178 HDGan
StoryGAN: A Sequential Conditional GAN for Story Visualization 2019 CVPR 1812.02784 StoryGAN
MirrorGAN: Learning Text-to-image Generation by Redescription 2019 CVPR 1903.05854
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis 2019 CVPR 1904.01310
Semantics Disentangling for Text-to-Image Generation 2019 CVPR 1904.01480
Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction 2019 ICCV 1811.09845 GeNeVA
Specifying Object Attributes and Relations in Interactive Scene Generation 2019 ICCV 1909.05379 scene_generation

Video Captioning

Title Conference / Journal Paper Code Remarks
Long-term Recurrent Convolutional Networks for Visual Recognition and Description 2015 CVPR 1411.4389
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks 2016 CVPR 1510.07712
Attention-Based Multimodal Fusion for Video Description 2017 CVPR 1701.03126
Semantic compositional networks for visual captioning 2017 CVPR 1611.08002
Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description 2017 CVPR CVPR_2017
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning 2018 CVPR 1804.00100
Adversarial Inference for Multi-Sentence Video Description 2019 CVPR 1812.05634 adv-inf
Streamlined Dense Video Captioning 2019 CVPR 1904.03870 DenseVideoCaptioning
Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning 2019 CVPR 1906.04375
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering 2021 WACV 2011.07735 iPerceive

Video Question Answering

Title Conference / Journal Paper Code Remarks
Movieqa: Understanding stories in movies through question-answering 2016 CVPR 1512.02902 MovieQA
TVQA: Localized, Compositional Video Question Answering 2018 EMNLP 1809.01696 TVQA
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions 2020 ECCV 2007.08751 ROLL-VideoQA
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering 2021 WACV 2011.07735 iPerceive

Video Understanding

Title Conference / Journal Paper Code Remarks
TSM: Temporal Shift Module for Efficient Video Understanding 2019 ICCV 1811.08383 temporal-shift-module
A Graph-Based Framework to Bridge Movies and Synopses 2019 ICCV 1910.11009

Vision and Language Navigation

Title Conference / Journal Paper Code Remarks
Embodied Question Answering 2018 CVPR 1711.11543 embodiedqa
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments 2018 CVPR 1711.07280 bringmeaspoon
Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation 2023 NeurIPS fda_pdf fda_code
Memory-adaptive vision-and-language navigation 2024 PR mam_paper

Vision-and-Language Pretraining

Title Conference / Journal Paper Code Remarks
LXMERT: Learning Cross-Modality Encoder Representations from Transformers 2019 EMNLP 1908.07490 lxmert
VideoBERT: A Joint Model for Video and Language Representation Learning 2019 ICCV 1904.01766
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks 2019 NIPS vilbert
OmniNet: A unified architecture for multi-modal multi-task learning 2019 arxiv 1907.07804 OmniNet
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training 2020 AAAI 1908.06066 Unicoder
Unified Vision-Language Pre-Training for Image Captioning and VQA 2020 AAAI 1909.11059 VLP
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks 2020 ECCV 1911.11237 Oscar
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments 2020 NIPS 2006.09882 swav
Learning to Learn Words from Visual Scenes 2020 ECCV 2004.06165 Oscar
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs 2021 AAAI 2006.16934 ERNIE
VinVL: Revisiting Visual Representations in Vision-Language Models 2021 CVPR 2101.00529 VinVL
VirTex: Learning Visual Representations from Textual Annotations 2021 CVPR 2006.06666 virtex
Learning Transferable Visual Models From Natural Language Supervision 2021 arxiv 2103.00020
Pretrained Transformers As Universal Computation Engines 2021 arxiv 2103.05247 universal-computation
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision 2021 arxiv 2102.05918
Self-supervised Pretraining of Visual Features in the Wild 2021 arxiv 2103.01988
Transformer is All You Need Multimodal Multitask Learning with a Unified Transformer 2021 arxiv 2102.10772
Zero-Shot Text-to-Image Generation 2021 arxiv 2102.12092
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training 2021 arxiv 2103.06561
Improved baselines for vision-language pre-training 2023 arxiv 2305.08675

Visual Dialog

Title Conference / Journal Paper Code Remarks
Visual Dialog 2017 CVPR 1611.08669 visdial visualdialog
Two Can Play This Game: Visual Dialog With Discriminative Question Generation and Answering 2018 CVPR 1803.11186
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation 2023 2303.05983 ATVC

Visual Grounding

Title Conference / Journal Paper Code Remarks
Modeling Relationships in Referential Expressions with Compositional Modular Networks 2017 CVPR 1611.09978 cmn
Phrase Localization Without Paired Training Examples 2019 ICCV 1908.07553
Learning to Assemble Neural Module Tree Networks for Visual Grounding 2019 ICCV 1812.03299
A Fast and Accurate One-Stage Approach to Visual Grounding 2019 ICCV 1908.06354
Zero-Shot Grounding of Objects from Natural Language Queries 2019 ICCV 1908.07129 zsgnet
Collaborative Transformers for Grounded Situation Recognition 2022 CVPR 2203.16518 CoFormer

Visual Question Answering

Title Conference / Journal Paper Code Remarks
VQA: Visual Question Answering 2015 ICCV 1505.00468 visualqa
Hierarchical question-image co-attention for visual question answering 2016 NIPS 1606.00061 HieCoAttenVQA
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding 2016 EMNLP 1606.01847 vqa-mcb
Stacked Attention Networks for Image Question Answering 2016 CVPR 1511.02274 imageqa-san
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering 2016 ECCV 1511.05234 AAAA
Dynamic Memory Networks for Visual and Textual Question Answering 2016 ICML 1603.01417 dmn-plus
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding 2016 EMNLP 1606.01847 vqa-mcb
Multimodal Residual Learning for Visual QA 2016 NIPS 1606.01455 nips-mrn-vqa
Graph-Structured Representations for Visual Question Answering 2017 CVPR 1609.05600
Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering 2017 CVPR 1612.00837
Learning to Reason: End-to-End Module Networks for Visual Question Answering 2017 ICCV 1704.05526
Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering 2018 AAAI 1803.08896 PSLQA
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering 2018 CVPR 1707.07998
Tips and Tricks for Visual Question Answering Learnings from the 2017 Challenge 2018 CVPR 1708.02711 vqa-winner
Transfer Learning via Unsupervised Task Discovery for Visual Question Answering 2019 CVPR 1810.02358 VQA-Transfer-ExternalData
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering 2019 CVPR 1902.09506 visualreasoning
Towards VQA Models That Can Read 2019 CVPR 1904.08920
From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason 2019 ICCV ICCV2019
An Empirical Study on Leveraging Scene Graphs for Visual Question Answering 2019 BMVC 1907.12133 scene-graphs-vqa
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning 2022 ICLR 2204.11167 RelViT
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation 2022 arXiv 2208.01813 TAG

Visual Reasoning

Title Conference / Journal Paper Code Remarks
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning 2017 CVPR 1612.06890
Inferring and Executing Programs for Visual Reasoning 2017 ICCV 1705.03633
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering 2019 CVPR 1902.09506 visualreasoning
Explainable and Explicit Visual Reasoning over Scene Graphs 2019 CVPR 1812.01855
From Recognition to Cognition: Visual Commonsense Reasoning 2019 CVPR 1811.10830 r2c VCR
Dynamic Graph Attention for Referring Expression Comprehension 2019 ICCV 1909.08164
Visual Semantic Reasoning for Image-Text Matching 2019 ICCV 1909.02701 VSRN
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning 2020 NeurIPS 2010.00763 Bongard-LOGO
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions 2022 CVPR 2205.13803 Bongard-HOI
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning 2022 ICLR 2204.11167 RelViT
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization 2023 ICCV 2307.15199 PromptStyler

Visual Relationship Detection

Title Conference / Journal Paper Code Remarks
Visual Relationship Detection with Language Priors 2016 ECCV 1608.00187 Visual-Relationship-Detection
ViP-CNN: Visual Phrase Guided Convolutional Neural Network 2017 CVPR 1702.07191
Visual Translation Embedding Network for Visual Relation Detection 2017 CVPR 1702.08319 drnet
Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection 2017 CVPR 1703.03054 DeepVariationRL
Detecting Visual Relationships with Deep Relational Networks 2017 CVPR 1704.03114 drnet
Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues 2017 ICCV 1611.06641 pl-clc
Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation 2017 ICCV 1707.09423
Referring Relationships 2018 CVPR 1803.10362 ReferringRelationships
Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition 2018 ECCV 1807.04979 ZoomNet
Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features 2018 ECCV 1808.00171 vrd
Leveraging Auxiliary Text for Deep Recognition of Unseen Visual Relationships 2020 ICLR 1910.12324

Visual Storytelling

Title Conference / Journal Paper Code Remarks
Visual Storytelling 2016 NAACL 1604.03968 visual_genome_python_driver VIST
No Metrics Are Perfect Adversarial Reward Learning for Visual Storytelling 2018 ACL 1804.09160 AREL
Show, Reward and Tell: Automatic Generation of Narrative Paragraph from Photo Stream by Adversarial Training 2018 AAAI
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling 2020 AAAI 2002.00774
Storytelling from an Image Stream Using Scene Graphs 2020 AAAI AAAI 2020

Contributing

Please feel free to send me pull requests or email (shmwoo9395@gmail.com) to add links.

Licenses

License

CC0

To the extent possible under law, Sangmin Woo has waived all copyright and related or neighboring rights to this work.