
awesome grounding: A curated list of research papers in visual grounding

Awesome Visual Grounding

A curated list of research papers in grounding. Link to the code if available is also present.

Have a look at SCOPE.md to get familiar with what grounding means and the tasks considered in this repository.

To maintaing the quality of the repo, I have gone through all the listed papers at least once before adding them to ensure their relevance to grounding. However, I might have missed some paper(s) or added some irrelevant paper(s). Feel free to open an issue in that case. I will go through the paper and then add / remove it.

Table of Contents


Feel free to contact me via email (ark.sadhu2904@gmail.com) or open an issue or submit a pull request. To add a new paper via pull request:

  1. Fork the repo, change readme. Put the new paper under the correct heading, and place it at the correct chronological position.
  2. Copy its reference in MLA format
  3. Put ** around the title
  4. Provide link to the paper (arxiv/semantic scholar/conference proceedings).
  5. If code or website exists, link that too.
  6. Send a pull request. Ideally, I will review the request within a week.


  1. MATTNet demo: http://vision2.cs.unc.edu/refer/comprehension

Other Compilations:

Shoutout to some other awesome stuff on vision and language grounding:

  1. Multi-modal Reading List by Paul Liang (@pliang279) : https://github.com/pliang279/awesome-multimodal-ml/
  2. Temporal Grounding by Mu Ketong (@iworldtong): https://github.com/iworldtong/Awesome-Grounding-Natural-Language-in-Video
  3. Temporal Grounding by WuJie (@WuJie1010): https://github.com/WuJie1010/Awesome-Temporally-Language-Grounding. Also, checkout their implementation of some of the popular papers: https://github.com/WuJie1010/Temporally-language-grounding


Image Grounding Datasets

Video Datasets

Embodied Agents Platforms:

Paper Roadmap (Chronological Order):

Visual Grounding / Referring Expressions (Images):

Natural Language Object Retrieval (Images)

Grounding Relations / Referring Relations

    • Critique of Referring Relationship paper

Video Grounding (Activity Localization) using Natural Language:

Grounded Description (Image) (WIP)

Grounded Description (Video) (WIP)

Visual Grounding Pretraining

Grounding for Embodied Agents (WIP):

