Task:
- given a query, find the corresponding moment in a given video. (major focus of this repo)
Markdown format:
- [Paper Name](link) - Author 1 et al, `Conference Year`. [[code]](link)
- 2020/07/27 start the repo.
- Papers before 2020 are mainly collected by muketong.
- to be updated ...
- grounding, retrieval, localization
- None.
- Grounded Language Learning from Video Described with Sentences - H. Yu et al,
ACL 2013
. - Visual Semantic Search: Retrieving Videos via Complex Textual Queries - Dahua Lin et al,
CVPR 2014
. - Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework - R. Xu et al,
AAAI 2015
. - Unsupervised Alignment of Actions in Video with Text Descriptions - Y. C. Song et al,
IJCAI 2016
.
- Localizing Moments in Video with Natural Language - Lisa Anne Hendricks et al,
ICCV 2017
. [code] - TALL: Temporal Activity Localization via Language Query - Jiyang Gao et al,
ICCV 2017
. [code]. - !(Still on arxiv 20200609)Where to Play: Retrieval of Video Segments using Natural-Language Queries - S. Lee et al,
arxiv 2017
.
- Attentive Moment Retrieval in Videos - M. Liu et al,
SIGIR 2018
. - Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos - B. Liu et al,
ECCV 2018
. - (Video Retrieval+Grounding)Find and Focus: Retrieve and Localize Video Events with Natural Language Queries - Dian Shao et al,
ECCV 2018
. - Temporally Grounding Natural Sentence in Video - J. Chen et al,
EMNLP 2018
. - Localizing Moments in Video with Temporal Language - Lisa Anne Hendricks et al,
EMNLP 2018
. - Cross-modal Moment Localization in Videos - Meng Liu et al,
MM 2018
.
Supervised:
- MAC: Mining Activity Concepts for Language-based Temporal Localization - Runzhou Ge Ge et al,
WACV 2019
. [code] - Multilevel Language and Vision Integration for Text-to-Clip Retrieval - H. Xu et al,
AAAI 2019
. [code] - Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos - He, Dongliang et al,
AAAI 2019
. - To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression - Y. Yuan et al,
AAAI 2019
. [code] - Semantic Proposal for Activity Localization in Videos via Sentence Query - S. Chen et al,
AAAI 2019
. - Localizing natural language in videos - J. Chen et al,
AAAI 2019
. - ExCL: Extractive Clip Localization Using Natural Language Descriptions - S. Ghosh et al,
NAACL 2019
. - Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention - B. Jiang et al,
ICMR 2019
. [code] - Language-Driven Temporal Activity Localization_ A Semantic Matching Reinforcement Learning Model - W. Wang et al,
CVPR 2019
. - MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment - Da Zhang et al,
CVPR 2019
. - Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos - Zhu Zhang et al,
SIGIR 2019
. [code] - Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos - Yitian Yuan et al,
NeurIPS 2019
. [code] - DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization - Chujie Lu et al,
EMNLP 2019
. - !(still on arxiv 20200609)Temporal Localization of Moments in Video Collections with Natural Language - V. Escorcia et al,
arxiv 2019
.
Weakly Supervised:
- Weakly Supervised Video Moment Retrieval From Text Queries - N. C. Mithun et al,
CVPR 2019
. - Weakly-supervised spatio-temporally grounding natural sentence in video - Zhenfang Chen et al,
ACL 2019
. [code] - WSLLN: Weakly Supervised Natural Language Localization Networks - M. Gao et al,
EMNLP 2019
.
Supervised:
- Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction - Zhijie Lin et al,
TIP 2020
. - Rethinking the Bottom-Up Framework for Query-based Video Localization - Long Chen et al,
AAAI 2020
. - Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction - Jingwen Wang et al,
AAAI 2020
. [code] - Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language - Songyang Zhang et al,
AAAI 2020
. [code] - Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video - Jie Wu et al,
AAAI 2020
. [code] - Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention - C. R. Opazo et al,
WACV 2020
. [code] - Local-Global Video-Text Interactions for Temporal Grounding - Mun Jonghwan et al,
CVPR 2020
. [code] - Dense Regression Network for Video Grounding - Zeng Runhao et al,
CVPR 2020
. [code] - Tripping through time: Efficient Localization of Activities in Videos - Meera Hahn et al,
BMVC 2020
. - Span-based Localizing Network for Natural Language Video Localization - Hao Zhang et al,
ACL 2020
. [code] - Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language - Shaoxiang Chen et al,
ECCV 2020
. [code] - Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos - Shaoxiang Chen et al,
ECCV 2020
. - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization - Daizong Liu et al,
MM 2020
. [code] - Fine-grained Iterative Attention Network for Temporal Language Localization in Videos - Xiaoye Qu et al,
MM 2020
. - Dual Path Interaction Network for Video Moment Localization - Hao Wang et al,
MM 2020
. - Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization - et al,
MM 2020
. [code] - STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization - Da Cao et al,
MM 2020
. [code] - Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos - Jie Wu et al,
MM 2020
. - Language Guided Networks for Cross-modal Moment Retrieval - Kun Liu et al,
arxiv
.
Weakly Supervised:
- Weakly-Supervised Video Moment Retrieval via Semantic Completion Network - Zhijie Lin et al,
AAAI 2020
. - VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval - Minuk Ma et al,
ECCV 2020
. - Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization - Yuanhao Zhai et al,
ECCV 2020
. - Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos - Zhu Zhang et al,
MM 2020
. [code] - Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding - Zhang Zhu et al,
NeruIPS 2020
.
- Interaction-Integrated Network for Natural Language Moment Localization - Ke Ning et al, 'TIP 2021'.
- Boundary Proposal Network for Two-Stage Natural Language Video Localization - Shaoning Xiao et al,
AAAI 2021
. - Context-Aware Biaffine Localizing Network for Temporal Sentence Grounding - Liu et al,
CVPR 2021
. - Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval - Zeng et al,
CVPR 2021
. - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers - Miech et al,
CVPR 2021
. - Fast Video Moment Retrieval - Gao et al,
ICCV 2021
. - Hierarchical Deep Residual Reasoning for Temporal Moment Localization - Ma et al,
arxiv
.
Conferences to be update:
- None