Awesome Visual Grounding

A curated list of research papers in grounding. Link to the code if available is also present.

Have a look at SCOPE.md to get familiar with what grounding means and the tasks considered in this repository.

To maintaing the quality of the repo, I have gone through all the listed papers at least once before adding them to ensure their relevance to grounding. However, I might have missed some paper(s) or added some irrelevant paper(s). Feel free to open an issue in that case. I will go through the paper and then add / remove it.

Contributing
Demos
Other Compilations
Datasets
- Image Grounding Datasets
- Video Datasets
Paper Roadmap (Chronological Order):

Contributing

Feel free to contact me via email (ark.sadhu2904@gmail.com) or open an issue or submit a pull request. To add a new paper via pull request:

Fork the repo, change readme. Put the new paper under the correct heading, and place it at the correct chronological position.
Copy its reference in MLA format
Put ** around the title
Provide link to the paper (arxiv/semantic scholar/conference proceedings).
If code or website exists, link that too.
Send a pull request. Ideally, I will review the request within a week.

Demos

MATTNet demo: http://vision2.cs.unc.edu/refer/comprehension

Other Compilations:

Shoutout to some other awesome stuff on vision and language grounding:

Multi-modal Reading List by Paul Liang (@pliang279) : https://github.com/pliang279/awesome-multimodal-ml/
Temporal Grounding by Mu Ketong (@iworldtong): https://github.com/iworldtong/Awesome-Grounding-Natural-Language-in-Video
Temporal Grounding by WuJie (@WuJie1010): https://github.com/WuJie1010/Awesome-Temporally-Language-Grounding. Also, checkout their implementation of some of the popular papers: https://github.com/WuJie1010/Temporally-language-grounding

Datasets

Image Grounding Datasets

Flickr30k: Plummer, Bryan A., et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE international conference on computer vision. 2015. [Paper] [Code] [Website]
RefClef: Kazemzadeh, Sahar, et al. Referitgame: Referring to objects in photographs of natural scenes. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. [Paper] [Website]
RefCOCOg: Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Visual Genome: Krishna, Ranjay, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123.1 (2017): 32-73. [Paper] [Website]
RefCOCO and RefCOCO+: 1. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]
GuessWhat: De Vries, Harm, et al. Guesswhat?! visual object discovery through multi-modal dialogue. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. [Paper] [Code] [Website]
Clevr-ref+: Liu, Runtao, et al. Clevr-ref+: Diagnosing visual reasoning with referring expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code] [Website]

Video Datasets

TaCoS: Regneri, Michaela, et al. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics 1 (2013): 25-36. [Paper] [Website]
Charades: Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Website]
Charades-STA: Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017).[Paper] [Code]
Distinct Describable Moments (DiDeMo): Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code] [Website]
ActivityNet Captions: Krishna, Ranjay, et al. Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision. 2017. [Paper] [Website]
Charades-Ego: [Website]
- Sigurdsson, Gunnar, et al. Actor and Observer: Joint Modeling of First and Third-Person Videos. CVPR-IEEE Conference on Computer Vision & Pattern Recognition. 2018. [Paper] [Code]
- Sigurdsson, Gunnar A., et al. "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos." arXiv preprint arXiv:1804.09626 (2018). [Paper] [Code]
TEMPO: Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]
ActivityNet-Entities: Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Embodied Agents Platforms:

Matterport3D: Chang, Angel, et al. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017). [Paper] [Code] [Website]
- Photorealistic rooms
AI2-THOR: Kolve, Eric, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017). [Paper] [Website]
- Actionable objects!
Habitat AI: Savva, Manolis, et al. Habitat: A platform for embodied ai research. Proceedings of the IEEE International Conference on Computer Vision. 2019. (ICCV 2019) [Paper] [Website]

Paper Roadmap (Chronological Order):

Visual Grounding / Referring Expressions (Images):

Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems. 2014. [Paper]
Karpathy, Andrej, and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Method name: Neural Talk. [Paper] [Code] [Torch Code] [Website]
Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]
Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Wang, Liwei, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]
Nagaraja, Varun K., Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. European Conference on Computer Vision. Springer, Cham, 2016.[Paper] [Code]
Rohrbach, Anna, et al. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016. Method Name: GroundR [Paper] [Tensorflow Code] [Torch Code]
Wang, Mingzhe, et al. Structured matching for phrase localization. European Conference on Computer Vision. Springer, Cham, 2016. Method name: Structured Matching [Paper] [Code]
Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Website]
Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP (2016). Method name: MCB [Paper][Code]
Endo, Ko, et al. An attention-based regression model for grounding textual phrases in images. Proc. IJCAI. 2017. [Paper]
Chen, Kan, et al. MSRC: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval 7.1 (2018): 17-28. [Paper -Springer Link]
Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]
Yu, Licheng, et al. A joint speakerlistener-reinforcer model for referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code][Website]
Hu, Ronghang, et al. Modeling relationships in referential expressions with compositional modular networks. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [Paper] [Code]
Luo, Ruotian, and Gregory Shakhnarovich. Comprehension-guided referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code]
Liu, Jingyu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. Proceedings of CVPR. 2017. [Paper]
Xiao, Fanyi, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371 (2017). [Paper]
Plummer, Bryan A., et al. Phrase localization and visual relationship detection with comprehensive image-language cues. Proc. ICCV. 2017. [Paper] [Code]
Chen, Kan, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: QRC [Paper] [Code]
Liu, Chenxi, et al. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV. 2017. [Paper] [Code]
Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]
Li, Xiangyang, and Shuqiang Jiang. Bundled Object Context for Referring Expressions. IEEE Transactions on Multimedia (2018). [Paper ieee link]
Yu, Zhou, et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. arXiv preprint arXiv:1805.03508 (2018). [Paper] [Code]
Yu, Licheng, et al. Mattnet: Modular attention network for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [Paper] [Code] [Website]
Deng, Chaorui, et al. Visual Grounding via Accumulated Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper]
Li, Ruiyu, et al. Referring image segmentation via recurrent refinement networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper] [Code]
Zhang, Yundong, Juan Carlos Niebles, and Alvaro Soto. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. arXiv preprint arXiv:1808.00265 (2018). [Paper]
Chen, Kan, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879 (2018). [Paper] [Code]
Zhang, Hanwang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code]
Cirik, Volkan, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. arXiv preprint arXiv:1805.10547 (2018).[Paper] [Code]
Margffoy-Tuay, Edgar, et al. Dynamic multimodal instance segmentation guided by natural language queries. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]
Shi, Hengcan, et al. Key-word-aware network for referring expression image segmentation. Proceedings of the European Conference on Computer Vision (ECCV). 2018.[Paper] [Code]
Plummer, Bryan A., et al. Conditional image-text embedding networks. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]
Akbari, Hassan, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv preprint arXiv:1811.11683 (2018). [Paper]
Kovvuri, Rama, and Ram Nevatia. PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding. arXiv preprint arXiv:1812.03213 (2018). [Paper]
Liu, Daqing, et al. Learning to Assemble Neural Module Tree Networks for Visual Grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019. [Paper] [Code]
Chen, Xinpeng, et al. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arXiv preprint arXiv:1812.03426 (2018). [Paper]
Wang, Peng, et al. Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks. arXiv preprint arXiv:1812.04794 (2018). [Paper]
RETRACTED (see #2): Deng, Chaorui, et al. You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding. arXiv preprint arXiv:1902.04213 (2019). [Paper]
Hong, Richang, et al. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). 2019. [Paper]
Liu, Xihui, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]
Dogan, Pelin, Leonid Sigal, and Markus Gross. Neural Sequential Phrase Grounding (SeqGROUND). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Datta, Samyak, et al. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. arXiv preprint arXiv:1903.11649 (2019). (ICCV 2019) [Paper]
Fang, Zhiyuan, et al. Modularized textual grounding for counterfactual resilience. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Ye, Linwei, et al. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Cross-Modal Relationship Inference for Grounding Referring Expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Dynamic Graph Attention for Referring Expression Comprehension. arXiv preprint arXiv:1909.08164 (2019). (ICCV 2019) [Paper] [Code]
Wang, Josiah, and Lucia Specia. "Phrase Localization Without Paired Training Examples." arXiv preprint arXiv:1908.07553 (2019). (ICCV 2019) [Paper] [Code]
Yang, Zhengyuan, et al. A Fast and Accurate One-Stage Approach to Visual Grounding. arXiv preprint arXiv:1908.06354 (2019). (ICCV 2019) [Paper] [Code]
Sadhu, Arka, Kan Chen, and Ram Nevatia. Zero-Shot Grounding of Objects from Natural Language Queries. arXiv preprint arXiv:1908.07129 (2019).(ICCV 2019) [Paper] [Code] Disclaimer: I am an author of the paper
Liu, Xuejing, et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. arXiv preprint arXiv:1908.10568 (2019). (ICCV 2019) [Paper] [Code]
Chen, Yi-Wen, et al. Referring Expression Object Segmentation with Caption-Aware Consistency. arXiv preprint arXiv:1910.04748 (2019). (BMVC 2019) [Paper] [Code]
Liu, Jiacheng, and Julia Hockenmaier. Phrase Grounding by Soft-Label Chain Conditional Random Field. arXiv preprint arXiv:1909.00301 (2019) (EMNLP 2019). [Paper] [Code]
Liu, Yongfei, Wan Bo, Zhu Xiaodan and He Xuming. Learning Cross-modal Context Graph for Visual Grounding. arXiv preprint arXiv: (2019) (AAAI-2020). [Paper] [Code]

Natural Language Object Retrieval (Images)

Guadarrama, Sergio, et al. Open-vocabulary Object Retrieval. Robotics: science and systems. Vol. 2. No. 5. 2014. [Paper] [Code]
Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]
Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]
Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]
Nguyen, Anh, et al. Object Captioning and Retrieval with Natural Language. arXiv preprint arXiv:1803.06152 (2018). [Paper] [Website]
Plummer, Bryan A., et al. Open-vocabulary Phrase Detection. arXiv preprint arXiv:1811.07212 (2018). [Paper] [Code]

Grounding Relations / Referring Relations

Krishna, Ranjay, et al. Referring relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code] [Website]
Raboh, Moshiko et al. Differentiable Scene Graphs. (2019). [Paper]
Conser, Erik, et al. Revisiting Visual Grounding. arXiv preprint arXiv:1904.02225 (2019). [Paper]
- Critique of Referring Relationship paper

Video Grounding (Activity Localization) using Natural Language:

Yu, Haonan, et al. Grounded Language Learning from Video Described with Sentences Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2013. [Paper]
Xu, Ran, et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence. 2015. [Paper]
Song, Young Chol, et al. Unsupervised Alignment of Actions in Video with Text Descriptions Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2016. [Paper]
Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017). Method name: TALL [Paper] [Code]
Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code]
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. Video Object Segmentation with Language Referring Expressions. arXiv preprint arXiv:1803.08006 (2018). [Paper] [Website]
Xu, Huijuan, et al. Joint Event Detection and Description in Continuous Video Streams. arXiv preprint arXiv:1802.10250 (2018). [Paper] [Code]
Xu, Huijuan, et al. Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning. arXiv preprint arXiv:1804.05113 (2018). [Paper] [Code]
Liu, Bingbin, et al. Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos. European Conference on Computer Vision. Springer, Cham, 2018. [Paper] [Website]
Liu, Meng, et al. Attentive Moment Retrieval in Videos. Proceedings of the International ACM SIGIR Conference . 2018. [Paper] [Website]
Chen, Jingyuan, et al. Temporally grounding natural sentence in video. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. [Paper]
Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]
Wu, Aming, and Yahong Han. Multi-modal Circulant Fusion for Video-to-Language and Backward. IJCAI. Vol. 3. No. 4. 2018. [Paper] [Code]
Ge, Runzhou, et al. MAC: Mining Actiivity Concepts for Language-based Temporal Localization. arXiv preprint arXiv:1811.08925 (2018). [Paper] [Code]
Zhang, Da, et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. arXiv preprint arXiv:1812.00087 (2018). [Paper]
He, Dongliang, et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence. 2019. [Paper]
Wang, Weining, Yan Huang, and Liang Wang. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]
Ghosh, Soham, et al. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv preprint arXiv:1904.02755 (2019). [Paper]
Chen, Shaoxiang, and Yu-Gang Jiang. Semantic Proposal For Activity Localization In Videos Via Sentence Query. Proceedings of the AAAI Conference on Artificial Intelligence. 2019.[Paper]
Yuan Y, Mei T, Zhu W. To Find Where You Talk: Temporal Sentence Localization In Video With Attention Based Location Regression. Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 9159-9166. [Paper]
Mithun N C, Paul S, Roy-Chowdhury A K. Weakly Supervised Video Moment Retrieval From Text Queries. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 11592-11601.[Paper]
Escorcia, Victor, et al. Temporal Localization of Moments in Video Collections with Natural Language. arXiv preprint arXiv:1907.12763 (2019). (ICCV 2019) [Paper] [Code]
Wang, Jingwen, Lin Ma, and Wenhao Jiang. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. arXiv preprint arXiv:1909.05010 (2019). (AAAI 2020) [Paper] [Code]

Grounded Description (Image) (WIP)

Hendricks, Lisa Anne, et al. Generating visual explanations. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Pytorch Code]
Jiang, Ming, et al. TIGEr: Text-to-Image Grounding for Image Caption Evaluation. arXiv preprint arXiv:1909.02050 (2019). (EMNLP 2019) [Paper] [Code]
Lee, Jason, Kyunghyun Cho, and Douwe Kiela. Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019). (EMNLP 2019) [Paper]

Grounded Description (Video) (WIP)

Ma, Chih-Yao, et al. Grounded Objects and Interactions for Video Captioning. arXiv preprint arXiv:1711.06354 (2017). [Paper]
Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Visual Grounding Pretraining

Sun, Chen, et al. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766 (2019). [Paper]
Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv preprint arXiv:1908.02265 (Neurips 2019) [Paper] [Code]
Li, Liunian Harold, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019). [Paper] [Code]
Li, Gen, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019). [Paper]
Tan, Hao, and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019). [Paper] [Code]
Su, Weijie, et al. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). [Paper]
Chen, Yen-Chun, et al. UNITER: Learning UNiversal Image-TExt Representations. arXiv preprint arXiv:1909.11740 (2019). [Paper]

Grounding for Embodied Agents (WIP):

Shridhar, Mohit, et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. arXiv preprint arXiv:1912.01734 (2019). [Paper] [Code] [Website]

Misc:

Han, Xudong, Philip Schulz, and Trevor Cohn. Grounding learning of modifier dynamics: An application to color naming. arXiv preprint arXiv:1909.07586 (2019). (EMNLP 2019) [Paper] [Code]
Yu, Xintong, et al. What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues. arXiv preprint arXiv:1909.00421 (2019). (EMNLP 2019) [Paper] [Code]

SikaStar/awesome-grounding