howardyclo/papernotes

Vision with Referring Expressions

howardyclo opened this issue 6 years ago · 0 comments

howardyclo commented 6 years ago

Vision with Referring Expressions (Last Update Date: 2019/03/06)

A curated list of deep learning papers of computer vision with referring natural language. This line of research is also related to image captioning, visual question answering, multimodal grounding for language and multimodal machine learning

Survey

From Image to Language and Back Again by Belz et al. Natural Language Engineering 2018.

Dataset

ReferItGame: Referring to Objects in Photographs of Natural Scenes by Kazemzadeh et al. EMNLP 2014.
Generation and Comprehension of Unambiguous Object Descriptions by Mao et al. 2015/11. CVPR 2016.

Detection

Generation and Comprehension of Unambiguous Object Descriptions by Mao et al. 2015/11. CVPR 2016.
Natural Language Object Retrieval by Hu et al. 2015/11. CVPR 2016.
Referring Expression Generation and Comprehension via Attributes by Liu et al. ICCV 2017.
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions by Yu et al. 2016/11. CVPR 2017.
Modeling Relationships in Referential Expressions with Compositional Modular Networks by Hu et al. 2016/11. CVPR 2017.
Object Referring in Videos with Language and Human Gaze by Vasudevan et al. 2018/01. CVPR 2018.
MAttNet: Modular Attention Network for Referring Expression Comprehension by Yu et al. 2018/01. CVPR 2018.
Visual Reasoning with Multi-hop Feature Modulation by Strub et al. 2018/08. ECCV 2018.
Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding by Liu et al. 2018/12.
Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks by Wang et al. 2018/12.

Tracking

Tracking by natural language specification by Li et al. CVPR 2017
Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking by Wang et al. 2018/11.

Moment Localization

TALL: Temporal Activity Localization via Language Query by Gao et al. 2017/05. ICCV 2017.
Localizing Moments in Video with Natural Language by Hendricks et al. 2017/08. ICCV 2017.
Cross-modal Moment Localization in Videos by Liu et al. ACM Multimedia 2018.
Multilevel Language and Vision Integration for Text-to-Clip Retrieval by Xu et al. 2018/04. AAAI 2019.
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions by Ning et al. 2018/08.
Attentive Moment Retrieval in Videos by Liu et al. SIGIR 2018.
Localizing Moments in Video with Temporal Language by Hendricks et al. 2018/09. EMNLP 2018.
Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos by Liu et al. ECCV 2018.
MAC: Mining Activity Concepts for Language-based Temporal Localization by Ge et al. 2018/11.
MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment by Zhang et al. 2018/12.

Segmentation

Segmentation from Natural Language Expressions by Hu et al. 2016/03. ECCV 2016.
Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions by Hu et al. 2016/08.
Recurrent Multimodal Interaction for Referring Image Segmentation by Liu et al. 2017/03. ICCV 2017.
VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation by Gan et al. ICCV 2017.
Key-Word-Aware Network for Referring Expression Image Segmentation by Shi et al. ECCV 2018.
Referring image segmentation via recurrent refinement networks by Li et al. CVPR 2018.
MAttNet: Modular Attention Network for Referring Expression Comprehension by Yu et al. 2018/01. CVPR 2018.
Guide Me: Interacting with Deep Networks by Rupprecht et al. 2018.03. CVPR 2018.
Video Object Segmentation with Language Referring Expressions by Khoreva et al. 2018/03.
Dynamic Multimodal Instance Segmentation guided by natural language queries by Margffoy-Tuay et al. 2018/07. ECCV 2018.

Grounding

Grounding of Textual Phrases in Images by Reconstruction by Rohrbach et al. 2015/11. ICCV 2016.
Modeling Context in Referring Expressions by Yu et al. 2016/08. ECCV 2016.
Modeling Relationships in Referential Expressions with Compositional Modular Networks by Hu et al. 2016/11. CVPR 2017.
Grounding Referring Expressions in Images by Variational Context by Zhang et al. 2017/12. CVPR 2018.
Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos by Huang et al. CVPR 2018.
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction by Zhou et al. 2018/05. BMVC 2018.
Temporally Grounding Natural Sentence in Video by Chen et al. EMNLP 2018.

Diagnosing

Visual Referring Expression Recognition: What Do Systems Actually Learn? by Cirik et al. 2018/05. NAACL 2018.
CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions [Blog] by Liu et al. 2019/01.