A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
Please feel free to send me pull requests or email (leungjokie@gmail.com) to add links. Markdown format:
- [Paper Name](link) - Author 1 et al, **Conference Year**. [[code]](link)
- Mar.3rd,2019 The First version released.
- Contributing
- Change Log
- Table of Contents
- Papers
- VQA Challenge Leaderboard
- Licenses
- Reference and Acknowledgement
- Visual question answering: Datasets, algorithms, and future challenges - Kushal Kafle et al, CVIU 2017.
- Visual question answering: A survey of methods and datasets - Qi Wu et al, CVIU 2017.
- Check It Again:Progressive Visual Question Answering via Visual Entailment - Qingyi Si et al, ACL 2021. [code]
- Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering - Siddharth Karamcheti et al, ACL 2021. [code]
- In Factuality: Efficient Integration of Relevant Facts for Visual Question Answering - Peter Vickers et al, ACL 2021.
- Towards Visual Question Answering on Pathology Images - Xuehai He et al, ACL 2021. [code]
- Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions - Daniel Rosenberg et al, ACL 2021. [code]
- LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering - Zujie Liang et al, SIGIR 2021. [code]
- Passage Retrieval for Outside-Knowledge Visual Question Answering - Chen Qu et al, SIGIR 2021. [code]
- Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering - Aman Jain et al, SIGIR 2021. [code]
- Visual Question Rewriting for Increasing Response Rate - Jiayi Wei et al, SIGIR 2021.
- Separating Skills and Concepts for Novel Visual Question Answering - Spencer Whitehead et al, CVPR 2021.
- Roses Are Red, Violets Are Blue... but Should VQA Expect Them To? - Corentin Kervadec et al, CVPR 2021 [code]
- Predicting Human Scanpaths in Visual Question Answering - Xianyu Chen et al, CVPR 2021
- Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules - Aisha Urooj et al, CVPR 2021
- TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption - Zhengyuan Yang et al, CVPR 2021
- Counterfactual VQA: A Cause-Effect Look at Language Bias - Yulei Niu et al, CVPR 2021 [code]
- KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA - Kenneth Marino et al, CVPR 2021
- Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing - Yuanyuan Yuan et al, CVPR 2021
- How Transferable Are Reasoning Patterns in VQA? - Corentin Kervadec et al, CVPR 2021
- Domain-Robust VQA With Diverse Datasets and Methods but No Target Labels - Mingda Zhang et al, CVPR 2021
- Learning Better Visual Dialog Agents With Pretrained Visual-Linguistic Representation - Tao Tu et al, CVPR 2021
- MultiModalQA: complex question answering over text, tables and images - Alon Talmor et al, ICLR 2021. [page]
- CLEVR_HYP: A Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images - Shailaja Keyur Sampat et al, NAACL-HLT 2021. [code]
- Video Question Answering with Phrases via Semantic Roles - Arka Sadhu et al, NAACL-HLT 2021.
- SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency - Sameer Dharur et al, NAACL-HLT 2021.
- EaSe: A Diagnostic Tool for VQA based on Answer Diversity - Shailza Jolly et al, NAACL-HLT 2021.
- Ensemble of MRR and NDCG models for Visual Dialog - Idan Schwartz, NAACL-HLT 2021. [code]
- Regularizing Attention Networks for Anomaly Detection in Visual Question Answering - Doyup Lee et al, AAAI 2021.
- A Case Study of the Shortcut Effects in Visual Commonsense Reasoning - Keren Ye et al, AAAI 2021. [code]
- VisualMRC: Machine Reading Comprehension on Document Images - Ryota Tanaka et al, AAAI 2021. [page]
- MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering - Tejas Gokhale et al, EMNLP 2020. [code]
- Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering - Zujie Liang et al, EMNLP 2020. [code]
- VD-BERT: A Unified Vision and Dialog Transformer with BERT - Yue Wang et al, EMNLP 2020.
- Multimodal Graph Networks for Compositional Generalization in Visual Question Answering - Raeid Saqur et al, NeurIPS 2020.
- Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies - Itai Gat et al, NeurIPS 2020.
- Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data - Michael Cogswell et al, NeurIPS 2020.
- On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law - Damien Teney et al, NeurIPS 2020.
- Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder - Gouthaman KV et al, ECCV 2020.
- Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions - Noa Garcia et al, ECCV 2020.
- Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering - Ruixue Tang et al, ECCV 2020.
- Visual Question Answering on Image Sets - Ankan Bansal et al, ECCV 2020.
- VQA-LOL: Visual Question Answering under the Lens of Logic - Tejas Gokhale et al, ECCV 2020.
- TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering - Xiaofeng Yang et al, ECCV 2020.
- Spatially Aware Multimodal Transformers for TextVQA - Yash Kant et al, ECCV 2020.
- Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text - Difei Gao et al, CVPR 2020. [code]
- On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering - Xinyu Wang et al, CVPR 2020.
- In Defense of Grid Features for Visual Question Answering - Huaizu Jiang et al, CVPR 2020.
- Counterfactual Samples Synthesizing for Robust Visual Question Answering - Long Chen et al, CVPR 2020.
- Counterfactual Vision and Language Learning - Ehsan Abbasnejad et al, CVPR 2020.
- Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA - Ronghang Hu et al, CVPR 2020.
- Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing - Vedika Agarwal et al, CVPR 2020.
- SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions - Ramprasaath R. Selvaraju et al, CVPR 2020.
- TA-Student VQA: Multi-Agents Training by Self-Questioning - Peixi Xiong et al, CVPR 2020.
- VQA With No Questions-Answers Training - Ben-Zion Vatashsky et al, CVPR 2020.
- Hierarchical Conditional Relation Networks for Video Question Answering - Thao Minh Le et al, CVPR 2020.
- Modality Shifting Attention Network for Multi-Modal Video Question Answering - Junyeong Kim et al, CVPR 2020.
- Webly Supervised Knowledge Embedding Model for Visual Reasoning - Wenbo Zheng et al, CVPR 2020.
- Differentiable Adaptive Computation Time for Visual Reasoning - Cristobal Eyzaguirre et al, CVPR 2020.
- A negative case analysis of visual grounding methods for VQA - Robik Shrestha et al, ACL 2020.
- Cross-Modality Relevance for Reasoning on Language and Vision - Chen Zheng et al, ACL 2020.
- Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA - Hyounghun Kim et al, ACL 2020.
- TVQA+: Spatio-Temporal Grounding for Video Question Answering - Jie Lei et al, ACL 2020.
- BERT representations for Video Question Answering - Zekun Yang et al, WACV 2020.
- Deep Bayesian Network for Visual Question Generation - Badri Patro et al, WACV 2020.
- Robust Explanations for Visual Question Answering - Badri Patro et al, WACV 2020.
- Visual Question Answering on 360deg Images - Shih-Han Chou et al, WACV 2020.
- LEAF-QA: Locate, Encode & Attend for Figure Question Answering - Ritwick Chaudhry et al, WACV 2020.
- Answering Questions about Data Visualizations using Efficient Bimodal Fusion - Kushal Kafle et al, WACV 2020.
- Multi‐Question Learning for Visual Question Answering - Chenyi Lei et al, AAAI 2020.
- Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA - Badri N. Patro et al, AAAI 2020.
- Overcoming Language Priors in VQA via Decomposed Linguistic Representations - Chenchen Jing et al, AAAI 2020.
- Unified Vision-Language Pre-Training for Image Captioning and VQA - Luowei Zhou et al, AAAI 2020.
- Re‐Attention for Visual Question Answering - Wenya Guo et al, AAAI 2020.
- Divide and Conquer: Question‐Guided Spatio‐Temporal Contextual Attention for Video Question Answering - Jianwen Jiang et al, AAAI 2020.
- Reasoning with Heterogeneous Graph Alignment for Video Question Answering - Pin Jiang et al, AAAI 2020.
- Location‐aware Graph Convolutional Networks for Video Question Answering - Deng Huang et al, AAAI 2020.
- KnowIT VQA: Answering Knowledge‐Based Questions about Videos - Noa Garcia et al, AAAI 2020.
- Generating Question Relevant Captions to Aid Visual Question Answering - Jialin Wu et al, ACL 2019.
- Psycholinguistics Meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering - Claudio Greco et al, ACL 2019. [code]
- Multi-grained Attention with Object-level Grounding for Visual Question Answering - Pingping Huang et al, ACL 2019.
- Improving Visual Question Answering by Referring to Generated Paragraph Captions - Hyounghun Kim et al, ACL 2019.
- Compact Trilinear Interaction for Visual Question Answering - Tuong Do Kim et al, ICCV 2019.
- Scene Text Visual Question Answering - Ali Furkan Biten et al, ICCV 2019.
- Multi-Modality Latent Interaction Network for Visual Question Answering - Peng Gao et al, ICCV 2019.
- Relation-Aware Graph Attention Network for Visual Question Answering - Linjie Li et al, ICCV 2019.
- Why Does a Visual Question Have Different Answers? - Nilavra Bhattacharya et al, ICCV 2019.
- RUBi: Reducing Unimodal Biases for Visual Question Answering - Remi Cadene et al, NeurIPS 2019.
- Self-Critical Reasoning for Robust Visual Question Answering - Jialin Wu et al, NeurIPS 2019.
- Deep Modular Co-Attention Networks for Visual Question Answering - Zhou Yu et al, CVPR 2019. [code]
- Information Maximizing Visual Question Generation - Ranjay Krishna et al, CVPR 2019. [code]
- Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence - Amir Zadeh et al, CVPR 2019. [code]
- Learning to Compose Dynamic Tree Structures for Visual Contexts - Kaihua Tang et al, CVPR 2019. [code]
- Transfer Learning via Unsupervised Task Discovery for Visual Question Answering - Hyeonwoo Noh et al, CVPR 2019. [code]
- Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph - Yao-Hung Hubert Tsai et al, CVPR 2019. [code]
- Explainable and Explicit Visual Reasoning over Scene Graphs - Jiaxin Shi et al, CVPR 2019. [code]
- MUREL: Multimodal Relational Reasoning for Visual Question Answering - Remi Cadene et al, CVPR 2019. [code]
- Image-Question-Answer Synergistic Network for Visual Dialog - Dalu Guo et al, CVPR 2019. [code]
- RAVEN: A Dataset for Relational and Analogical Visual rEasoNing - Chi Zhang et al, CVPR 2019. [project page]
- Cycle-Consistency for Robust Visual Question Answering - Meet Shah et al, CVPR 2019.
- It's Not About the Journey; It's About the Destination: Following Soft Paths Under Question-Guidance for Visual Reasoning - Monica Haurilet et al, CVPR 2019.
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge - Kenneth Marino et al, CVPR 2019.
- Visual Question Answering as Reading Comprehension - Hui Li et al, CVPR 2019.
- Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering - Peng Gao et al, CVPR 2019.
- Explicit Bias Discovery in Visual Question Answering Models - Varun Manjunatha et al, CVPR 2019.
- Answer Them All! Toward Universal Visual Question Answering Models - Robik Shrestha et al, CVPR 2019.
- Visual Query Answering by Entity-Attribute Graph Matching and Reasoning - Peixi Xiong et al, CVPR 2019.
- Differential Networks for Visual Question Answering - Chenfei Wu et al, AAAI 2019. [code]
- BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection - Hedi Ben-younes et al, AAAI 2019. [code]
- Dynamic Capsule Attention for Visual Question Answering - Yiyi Zhou et al, AAAI 2019. [code]
- Structured Two-stream Attention Network for Video Question Answering - Lianli Gao et al, AAAI 2019. [code]
- Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering - Xiangpeng Li et al, AAAI 2019. [code]
- WK-VQA: World Knowledge-enabled Visual Question Answering - Sanket Shah et al, AAAI 2019. [code]
- Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning - Yiyi Zhou et al, AAAI 2019. [code]
- Focal Visual-Text Attention for Memex Question Answering - Junwei Liang et al, TPAMI 2019. [code]
- Plenty is Plague: Fine-Grained Learning for Visual Question Answering - Yiyi Zhou et al, TPAMI 2019.
- Combining Multiple Cues for Visual Madlibs Question Answering - Tatiana Tommasi et al, IJCV 2019. [code]
- Large-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation - Sang-Woo Lee et al, ICLR 2019. [code]
- Bilinear Attention Networks - Jin-Hwa Kim et al, NIPS 2018. [code]
- Chain of Reasoning for Visual Question Answering - Chenfei Wu et al, NIPS 2018. [code]
- Learning Conditioned Graph Structures for Interpretable Visual Question Answering - Will Norcliffe-Brown et al, NIPS 2018. [code]
- Learning to Specialize with Knowledge Distillation for Visual Question Answering - Jonghwan Mun et al, NIPS 2018. [code]
- Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering - Medhini Narasimhan et al, NIPS 2018. [code]
- Overcoming Language Priors in Visual Question Answering with Adversarial Regularization - Sainandan Ramakrishnan et al, NIPS 2018. [code]
- Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering - Somak Aditya et al, AAAI 2018. [code]
- Co-Attending Free-Form Regions and Detections with Multi-Modal Multiplicative Feature Embedding for Visual Question Answering - Pan Lu et al, AAAI 2018. [code]
- Exploring Human-Like Attention Supervision in Visual Question Answering - Somak Aditya et al, AAAI 2018. [code]
- Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents - Bo Wang et al, AAAI 2018. [code]
- Feature Enhancement in Attention for Visual Question Answering - Yuetan Lin et al, IJCAI 2018. [code]
- A Question Type Driven Framework to Diversify Visual Question Generation - Zhihao Fan et al, IJCAI 2018. [code]
- Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network - Zhou Zhao et al, IJCAI 2018. [code]
- Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks - Zhou Zhao et al, IJCAI 2018. [code]
- Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering - Peter Anderson et al, CVPR 2018. [code(author)] [code(pythiaV0.1)] [code(Pytorch Reimplementation)]
- Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge - Damien Teney et al, CVPR 2018. [code]
- Learning by Asking Questions - Ishan Misra et al, CVPR 2018. [code]
- Embodied Question Answering - Abhishek Das et al, CVPR 2018. [code]
- VizWiz Grand Challenge: Answering Visual Questions From Blind People - Danna Gurari et al, CVPR 2018. [code]
- Textbook Question Answering Under Instructor Guidance With Memory Networks - Juzheng Li et al, CVPR 2018. [code]
- IQA: Visual Question Answering in Interactive Environments - Daniel Gordon et al, CVPR 2018. [sample video]
- Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering - Aishwarya Agrawal et al, CVPR 2018. [code]
- Learning Answer Embeddings for Visual Question Answering - Hexiang Hu et al, CVPR 2018. [code]
- DVQA: Understanding Data Visualizations via Question Answering - Kushal Kafle et al, CVPR 2018. [code]
- Cross-Dataset Adaptation for Visual Question Answering - Wei-Lun Chao et al, CVPR 2018. [code]
- Two Can Play This Game: Visual Dialog With Discriminative Question Generation and Answering - Unnat Jain et al, CVPR 2018. [code]
- Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering - Duy-Kien Nguyen et al, CVPR 2018. [code]
- Visual Question Generation as Dual Task of Visual Question Answering - Yikang Li et al, CVPR 2018. [code]
- Focal Visual-Text Attention for Visual Question Answering - Junwei Liang et al, CVPR 2018. [code]
- Motion-Appearance Co-Memory Networks for Video Question Answering - Jiyang Gao et al, CVPR 2018. [code]
- Visual Question Answering With Memory-Augmented Networks - Chao Ma et al, CVPR 2018. [code]
- Visual Question Reasoning on General Dependency Tree - Qingxing Cao et al, CVPR 2018. [code]
- Differential Attention for Visual Question Answering - Badri Patro et al, CVPR 2018. [code]
- Learning Visual Knowledge Memory Networks for Visual Question Answering - Zhou Su et al, CVPR 2018. [code]
- IVQA: Inverse Visual Question Answering - Feng Liu et al, CVPR 2018. [code]
- Customized Image Narrative Generation via Interactive Visual Question Generation and Answering - Andrew Shin et al, CVPR 2018. [code]
- Object-Difference Attention: A simple relational attention for Visual Question Answering - Chenfei Wu et al, ACM MM 2018. [code]
- Enhancing Visual Question Answering Using Dropout - Zhiwei Fang et al, ACM MM 2018. [code]
- Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering - Xuanyi Dong et al, ACM MM 2018. [code]
- Explore Multi-Step Reasoning in Video Question Answering - Xiaomeng Song et al, ACM MM 2018. [code] [SVQA dataset]
- Visual Question Answering as a Meta Learning Task - Damien Teney et al, ECCV 2018. [code]
- Question-Guided Hybrid Convolution for Visual Question Answering - Peng Gao et al, ECCV 2018. [code]
- Goal-Oriented Visual Question Generation via Intermediate Rewards - Junjie Zhang et al, ECCV 2018. [code]
- Multimodal Dual Attention Memory for Video Story Question Answering - Kyung-Min Kim et al, ECCV 2018. [code]
- A Joint Sequence Fusion Model for Video Question Answering and Retrieval - Youngjae Yu et al, ECCV 2018. [code]
- Deep Attention Neural Tensor Network for Visual Question Answering - Yalong Bai et al, ECCV 2018. [code]
- Question Type Guided Attention in Visual Question Answering - Yang Shi et al, ECCV 2018. [code]
- Learning Visual Question Answering by Bootstrapping Hard Attention - Mateusz Malinowski et al, ECCV 2018. [code]
- Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering - Medhini Narasimhan et al, ECCV 2018. [code]
- Visual Question Generation for Class Acquisition of Unknown Objects - Kohei Uehara et al, ECCV 2018. [code]
- Image Captioning and Visual Question Answering Based on Attributes and External Knowledge - Qi Wu et al, TPAMI 2018. [code]
- FVQA: Fact-Based Visual Question Answering - Peng Wang et al, TPAMI 2018. [code]
- R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering - Pan Lu et al, SIGKDD 2018. [code(Dataset)]
- Interpretable Counting for Visual Question Answering - Alexander Trott et al, ICLR 2018. [code]
- Learning to Count Objects in Natural Images for Visual Question Answering - Yan Zhang et al, ICLR 2018. [code]
- A Better Way to Attend: Attention With Trees for Video Question Answering - Hongyang Xue et al, TIP 2018. [code]
- Zero-Shot Transfer VQA Dataset - Pan Lu et al, arxiv preprint. [code]
- Visual Question Answering using Explicit Visual Attention - Vasileios Lioutas et al, *ISCAS 2018*. [code]
- Explicit ensemble attention learning for improving visual question answering - Vasileios Lioutas et al, *Pattern Recognition Letters 2018*. [code]
Please check the other papers list from VQA area between 2017-2015 in awesome-vqa from JamesChuanggg, it seems that he hasn't maintained that project for a long time. Really appreciate for his work. I will merge his work to this list in the future.Stay tuned...
- Learning to Reason: End-to-End Module Networks for Visual Question Answering - Ronghang Hu et al, ICCV 2017. [code]
- Structured Attentions for Visual Question Answering - Chen Zhu et al, ICCV 2017. [code]
- VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation - Chuang Gan et al, ICCV 2017. [code]
- Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering - Zhou Yu et al, ICCV 2017. [code]
- An Analysis of Visual Question Answering Algorithms - Kushal Kafle et al, ICCV 2017. [code]
- MUTAN: Multimodal Tucker Fusion for Visual Question Answering - Hedi Ben-younes et al, ICCV 2017. [code]
- MarioQA: Answering Questions by Watching Gameplay Videos - Jonghwan Mun et al, ICCV 2017. [code]
- Learning to Disambiguate by Asking Discriminative Questions - Yining Li et al, ICCV 2017. [code]
I will collect the leaderboard's implementations in the future.Stay tuned...
To the extent possible under law, Jokie Leung has waived all copyright and related or neighboring rights to this work.
Really appreciate for their contributions in this area.