Paper | Code |
---|---|
Transferring Domain-Agnostic Knowledge in Video Question Answering - Tianran Wu et al, BMVC 2021. | **No Code ** |
Paper | Code |
---|---|
Progressive Graph Attention Network for Video Question Answering - Liang Peng et al, ACM MM 2021. | No Code |
Pairwise VLAD Interaction Network for Video Question Answering - Hui Wang et al, ACM MM 2021. | No Code |
Paper | Code |
---|---|
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events - Li Xu et al, CVPR 2021. | [code] |
Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering - Jungin Park et al, CVPR 2021. | No Code |
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions - Junbin Xiao et al, CVPR 2021. | [code] |
Paper | Code |
---|---|
Just Ask: Learning To Answer Questions From Millions of Narrated Videos - Antoine Yang et al, ICCV 2021. | [code] |
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments - Difei Gao et al, ICCV 2021. | No Code |
On The Hidden Treasure of Dialog in Video Question Answering - Deniz Engin et al, ICCV 2021. | [code] |
Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos - Heeseung Yun et al, ICCV 2021. | [code] |
HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering - Fei Liu et al, ICCV 2021. | No Code |
Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature - Nayoung Kim et al, ICCV 2021. | No Code |
Paper | Code |
---|---|
Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering Weijiang Yu et al, NeurIPS 2021. | No Code |
Paper | Code |
---|---|
Video Question Answering with Phrases via Semantic Roles - Arka Sadhu et al, NAACL-HLT 2021. | No Code |
Paper | Code |
---|---|
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding - Hu Xu et al, EMNLP 2021. | [code] |
Paper | Code |
---|---|
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering - Ahjeong Seo et al, ACL 2021. | [code] |
Multi-Scale Progressive Attention Network for Video Question Answering Zhicheng Guo et al, ACL 2021. | No code |
Paper | Code |
---|---|
Self-supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA - Seonhoon Kim et al, AAAI 2021. | No code |
Paper | Code |
---|---|
Two-Stream Spatiotemporal Compositional Attention Network for VideoQA - Taiki Miyanishi et al, BMVC 2020 | No code |
On Modality Bias in the TVQA Dataset - Thomas Winterbottom et al, BMVC 2020 | No code |
Paper | Code |
---|---|
Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering - Fei Liu et al, ACM MM 2020. | No code |
Paper | Code |
---|---|
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions - Noa Garcia et al, ECCV 2020. | [code] |
Paper | Code |
---|---|
Hierarchical Conditional Relation Networks for Video Question Answering - Thao Minh Le et al, CVPR 2020. | [code] |
Modality Shifting Attention Network for Multi-Modal Video Question Answering - Junyeong Kim et al, CVPR 2020. | No code |
Paper | Code |
---|---|
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA - Hyounghun Kim et al, ACL 2020. | [code] |
TVQA+: Spatio-Temporal Grounding for Video Question Answering - Jie Lei et al, ACL 2020. | [code] |
Paper | Code |
---|---|
BERT representations for Video Question Answering - Zekun Yang et al, WACV 2020. | [code] |
Paper | Code |
---|---|
Divide and Conquer: Question‐Guided Spatio‐Temporal Contextual Attention for Video Question Answering - Jianwen Jiang et al, AAAI 2020. | No code |
Reasoning with Heterogeneous Graph Alignment for Video Question Answering - Pin Jiang et al, AAAI 2020. | No code |
Location‐aware Graph Convolutional Networks for Video Question Answering - Deng Huang et al, AAAI 2020. | No code |
KnowIT VQA: Answering Knowledge‐Based Questions about Videos - Noa Garcia et al, AAAI 2020. | No code |
Paper | Code |
---|---|
Spatio-temporal Relational Reasoning for Video Question Answering - Gursimran Singh et al, BMVC 2019. | [code] |
Paper | Code |
---|---|
Multi-interaction Network with Object Relation for Video Question Answering - Weike Jin et al, ACM MM 2019. | No code |
Question-Aware Tube-Switch Network for Video Question Answering - Tianhao Yang et al, ACM MM 2019. | No code |
Learnable Aggregating Net with Diversity Learning for Video Question Answering - Xiangpeng Li et al, ACM MM 2019. | No code |
Paper | Code |
---|---|
Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph - Yao-Hung Hubert Tsai et al, CVPR 2019. | [code] |
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering - Chenyou Fan et al, CVPR 2019. | [code] |
Progressive Attention Memory Network for Movie Story Question Answering - Junyeong Kim et al, CVPR 2019. | No code |
Paper | Code |
---|---|
Structured Two-stream Attention Network for Video Question Answering - Lianli Gao et al, AAAI 2019. | No code |
Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering - Xiangpeng Li et al, AAAI 2019. | [code] |
Paper | Code |
---|---|
Explore Multi-Step Reasoning in Video Question Answering - Xiaomeng Song et al, ACM MM 2018. | [code] [SVQA dataset] |
Paper | Code |
---|---|
Multimodal Dual Attention Memory for Video Story Question Answering - Kyung-Min Kim et al, ECCV 2018. | No code |
A Joint Sequence Fusion Model for Video Question Answering and Retrieval - Youngjae Yu et al, ECCV 2018. | [code] |
Paper | Code |
---|---|
Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network - Zhou Zhao et al, IJCAI 2018. | No code |
Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks - Zhou Zhao et al, IJCAI 2018. | No code |
Paper | Code |
---|---|
Motion-Appearance Co-Memory Networks for Video Question Answering - Jiyang Gao et al, CVPR 2018. | No code |
Paper | Code |
---|---|
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents - Bo Wang et al, AAAI 2018. | No code |
Paper | Code |
---|---|
A Better Way to Attend: Attention With Trees for Video Question Answering - Hongyang Xue et al, TIP 2018. | [code] |
https://arxiv.org/pdf/1511.04670v1.pdf
Paper | Code |
---|---|
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun, Leveraging Video Descriptions to Learn Video Question Answering, AAAI 2017. | No code |
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler, MovieQA: Understanding Stories in Movies Through Question-Answering, CVPR 2016. | [code] |
Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann, [Uncovering Temporal Context for Video Question and Answering](1511.04670v1.pdf (arxiv.org)), arXiv:1511.05676v1, Nov 2015. | No code |
Paper | Code |
---|---|
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue - Hung Le et al, ACL 2021 | [code] |
Structured Co-reference Graph Attention for Video-grounded Dialogue - Junyeong Kim et al, AAAI 2021 | No code |
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems - Hung Le et al, ACL 2019. | [code] |
MarioQA: Answering Questions by Watching Gameplay Videos - Jonghwan Mun et al, ICCV 2017. No code
Video Background Music Generation with Controllable Music Transformer - Shangzhe Di et al, ACM MM 2021
Notes:
(Arxiv "#")
(Other modality marked as "Acou" or "Sub" or "Lan")
YT-T: Youtube-Temporal-180M [Zellers et al., 2021]
CC: Conceptual Captions-3M [Sharma et al., 2018]
Web: WebVid2.5M [Bain et al., 2021]
Method Title | Insights & Methods | Video Encoder | Text(Question) Encoder | CM-PT | MSVD | MSRVTT | OD#1 | OD#2 | OD3 |
---|---|---|---|---|---|---|---|---|---|
HME[Fan et al., 2019] | Mem | RN, VGG, C3D | GV | \ | 33.7 | 33.0 | |||
DualVGR[Wang et al., 2021] | GNN | RN, RX(3D) | GV | \ | 39.0 | 35.5 | |||
PGAT[Peng et al., 2021] | GNN, MG, HL | RN, RX(3D), RoI | GV | \ | 39.0 | 38.1 | |||
HQGA[Xiao et al., 2022] | MN, GNN, HL, MG | RN, RX(3D), RoI | BT | \ | 41.2 | 38.6 | NExT -QA: 51.8 | ||
MERLOT[Zellers et al., 2021] | TF | ViT(E2E) | BT | YT-T&CC | \ | 43.1 | |||
Just Ask(VQA-T)[Yang et al., 2021] | TF | S3D ? | BT | H2VQA69M | 46.3 | 41.5 | |||
ALPRO[Li et al,. 2022] | TF | 45.9 (46.3 best) | 42.1 | ||||||
# VIOLET[Fu et al., 2021] | TF | VSwin (E2E) | BT | Web&YT-T&CC | [47.9] | [43.9] | |||
# Singularity-temporal[Lei et al,. 2022] | TF | [43.9] | |||||||
# All in One[Wang et al., 2022] | TF | 47.9 (1.6↑) | 44.3 (1.2↑) | ||||||
Their Backbone +IGV | 40.8 | 38.3 |