Video Question Answering Review(Video QA)

2021

BMVC

Paper Code
Transferring Domain-Agnostic Knowledge in Video Question Answering - Tianran Wu et al, BMVC 2021. **No Code **

ACM MM

Paper Code
Progressive Graph Attention Network for Video Question Answering - Liang Peng et al, ACM MM 2021. No Code
Pairwise VLAD Interaction Network for Video Question Answering - Hui Wang et al, ACM MM 2021. No Code

CVPR

Paper Code
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events - Li Xu et al, CVPR 2021. [code]
Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering - Jungin Park et al, CVPR 2021. No Code
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions - Junbin Xiao et al, CVPR 2021. [code]

ICCV

Paper Code
Just Ask: Learning To Answer Questions From Millions of Narrated Videos - Antoine Yang et al, ICCV 2021. [code]
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments - Difei Gao et al, ICCV 2021. No Code
On The Hidden Treasure of Dialog in Video Question Answering - Deniz Engin et al, ICCV 2021. [code]
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos - Heeseung Yun et al, ICCV 2021. [code]
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering - Fei Liu et al, ICCV 2021. No Code
Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature - Nayoung Kim et al, ICCV 2021. No Code

NeurIPS

Paper Code
Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering Weijiang Yu et al, NeurIPS 2021. No Code

NAACL-HLT

Paper Code
Video Question Answering with Phrases via Semantic Roles - Arka Sadhu et al, NAACL-HLT 2021. No Code

EMNLP

Paper Code
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding - Hu Xu et al, EMNLP 2021. [code]

ACL

Paper Code
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering - Ahjeong Seo et al, ACL 2021. [code]
Multi-Scale Progressive Attention Network for Video Question Answering Zhicheng Guo et al, ACL 2021. No code

AAAI

Paper Code
Self-supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA - Seonhoon Kim et al, AAAI 2021. No code

2020

BMVC

Paper Code
Two-Stream Spatiotemporal Compositional Attention Network for VideoQA - Taiki Miyanishi et al, BMVC 2020 No code
On Modality Bias in the TVQA Dataset - Thomas Winterbottom et al, BMVC 2020 No code

ACM MM

Paper Code
Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering - Fei Liu et al, ACM MM 2020. No code

ECCV

Paper Code
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions - Noa Garcia et al, ECCV 2020. [code]

CVPR

Paper Code
Hierarchical Conditional Relation Networks for Video Question Answering - Thao Minh Le et al, CVPR 2020. [code]
Modality Shifting Attention Network for Multi-Modal Video Question Answering - Junyeong Kim et al, CVPR 2020. No code

ACL

Paper Code
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA - Hyounghun Kim et al, ACL 2020. [code]
TVQA+: Spatio-Temporal Grounding for Video Question Answering - Jie Lei et al, ACL 2020. [code]

WACV

Paper Code
BERT representations for Video Question Answering - Zekun Yang et al, WACV 2020. [code]

AAAI

Paper Code
Divide and Conquer: Question­‐Guided Spatio­‐Temporal Contextual Attention for Video Question Answering - Jianwen Jiang et al, AAAI 2020. No code
Reasoning with Heterogeneous Graph Alignment for Video Question Answering - Pin Jiang et al, AAAI 2020. No code
Location­‐aware Graph Convolutional Networks for Video Question Answering - Deng Huang et al, AAAI 2020. No code
KnowIT VQA: Answering Knowledge­‐Based Questions about Videos - Noa Garcia et al, AAAI 2020. No code

2019

BMVC

Paper Code
Spatio-temporal Relational Reasoning for Video Question Answering - Gursimran Singh et al, BMVC 2019. [code]

ACM MM

Paper Code
Multi-interaction Network with Object Relation for Video Question Answering - Weike Jin et al, ACM MM 2019. No code
Question-Aware Tube-Switch Network for Video Question Answering - Tianhao Yang et al, ACM MM 2019. No code
Learnable Aggregating Net with Diversity Learning for Video Question Answering - Xiangpeng Li et al, ACM MM 2019. No code

CVPR

Paper Code
Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph - Yao-Hung Hubert Tsai et al, CVPR 2019. [code]
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering - Chenyou Fan et al, CVPR 2019. [code]
Progressive Attention Memory Network for Movie Story Question Answering - Junyeong Kim et al, CVPR 2019. No code

AAAI

Paper Code
Structured Two-stream Attention Network for Video Question Answering - Lianli Gao et al, AAAI 2019. No code
Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering - Xiangpeng Li et al, AAAI 2019. [code]

2018

ACM MM

Paper Code
Explore Multi-Step Reasoning in Video Question Answering - Xiaomeng Song et al, ACM MM 2018. [code] [SVQA dataset]

ECCV

Paper Code
Multimodal Dual Attention Memory for Video Story Question Answering - Kyung-Min Kim et al, ECCV 2018. No code
A Joint Sequence Fusion Model for Video Question Answering and Retrieval - Youngjae Yu et al, ECCV 2018. [code]

IJCAI

Paper Code
Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network - Zhou Zhao et al, IJCAI 2018. No code
Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks - Zhou Zhao et al, IJCAI 2018. No code

CVPR

Paper Code
Motion-Appearance Co-Memory Networks for Video Question Answering - Jiyang Gao et al, CVPR 2018. No code

AAAI

Paper Code
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents - Bo Wang et al, AAAI 2018. No code

TIP

Paper Code
A Better Way to Attend: Attention With Trees for Video Question Answering - Hongyang Xue et al, TIP 2018. [code]

Before 2018

https://arxiv.org/pdf/1511.04670v1.pdf

Paper Code
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun, Leveraging Video Descriptions to Learn Video Question Answering, AAAI 2017. No code
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler, MovieQA: Understanding Stories in Movies Through Question-Answering, CVPR 2016. [code]
Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann, [Uncovering Temporal Context for Video Question and Answering](1511.04670v1.pdf (arxiv.org)), arXiv:1511.05676v1, Nov 2015. No code

Related Tasks

multi-turn video question answering (video-grounded dialogue)

Paper Code
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue - Hung Le et al, ACL 2021 [code]
Structured Co-reference Graph Attention for Video-grounded Dialogue - Junyeong Kim et al, AAAI 2021 No code
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems - Hung Le et al, ACL 2019. [code]

....

MarioQA: Answering Questions by Watching Gameplay Videos - Jonghwan Mun et al, ICCV 2017. No code

Video Background Music Generation with Controllable Music Transformer - Shangzhe Di et al, ACM MM 2021

SOTA till 2022

Notes:

(Arxiv "#")

(Other modality marked as "Acou" or "Sub" or "Lan")

YT-T: Youtube-Temporal-180M [Zellers et al., 2021]

CC: Conceptual Captions-3M [Sharma et al., 2018]

Web: WebVid2.5M [Bain et al., 2021]

Method Title Insights & Methods Video Encoder Text(Question) Encoder CM-PT MSVD MSRVTT OD#1 OD#2 OD3
HME[Fan et al., 2019] Mem RN, VGG, C3D GV \ 33.7 33.0
DualVGR[Wang et al., 2021] GNN RN, RX(3D) GV \ 39.0 35.5
PGAT[Peng et al., 2021] GNN, MG, HL RN, RX(3D), RoI GV \ 39.0 38.1
HQGA[Xiao et al., 2022] MN, GNN, HL, MG RN, RX(3D), RoI BT \ 41.2 38.6 NExT -QA: 51.8
MERLOT[Zellers et al., 2021] TF ViT(E2E) BT YT-T&CC \ 43.1
Just Ask(VQA-T)[Yang et al., 2021] TF S3D ? BT H2VQA69M 46.3 41.5
ALPRO[Li et al,. 2022] TF 45.9 (46.3 best) 42.1
# VIOLET[Fu et al., 2021] TF VSwin (E2E) BT Web&YT-T&CC [47.9] [43.9]
# Singularity-temporal[Lei et al,. 2022] TF [43.9]
# All in One[Wang et al., 2022] TF 47.9 (1.6↑) 44.3 (1.2↑)
Their Backbone +IGV 40.8 38.3