
An exploration into the possibility of generating multi-sentence image descriptions by leveraging the latent dependencies between visual concepts in an image with their textual counterparts

Primary LanguagePython


This repository is an exploration into the possibility of generating multi-sentence image descriptions by leveraging the latent dependencies between visual concepts in an image with their textual counterparts, optimized with a structured objective.


We will be combining information from the Visual Genome dataset and the dataset built as part of the A Hierarchical Approach for Generating Descriptive Image Paragraphs. To be more precise, we will be utilizing the bounding boxes and corresponding region descriptions from the former and the paragraph captions from the latter. This combined dataset will be a list of objects (19561) each having the following structure: {"id" : int , "url": str, "paragraph": str , "regions": []}. The regions key has a list of regions corresponding to that image in the following format: {"image_id": "region_id": int, "x": int, "y": int, "height": int, "width": int, "phrase": str}


To build the dataset for this project, there are some prerequisites that need to be satisfied:

  1. Download the metadata related to the dataset used in the A Hierarchical Approach for Generating Descriptive Image Paragraphs paper and unzip it to your data directory on your local drive.

  2. Download the image metadata and region descriptions from the Visual Genome dataset and unzip it to your data directory on you local drive.

  3. Setup a config directory on the same level as your data directory and create a resources.py file with the following variables - data_path, VG_REG_FNAME, VG_IMG_FNAME and PARA_FNAME that represent the data directory, filenames of the unzipped visual genome image metadata, region description data and paragraph dataset respectively (sans the json extension).

  4. Run the organize_data.py file as follows: python organize_data.py --op align_data

You will find the final dataset under your data directory in the combined_data.json file


  title={A Hierarchical Approach for Generating Descriptive Image Paragraphs},
  author={Krause, Jonathan and Johnson, Justin and Krishna, Ranjay and Fei-Fei, Li},
  booktitle={Computer Vision and Patterm Recognition (CVPR)},
  title={Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations},
  author={Krishna, Ranjay and Zhu, Yuke and Groth, Oliver and Johnson, Justin and Hata, Kenji and Kravitz, Joshua and Chen, Stephanie and Kalanditis, Yannis and Li, Li-Jia and Shamma, David A and Bernstein, Michael and Fei-Fei, Li},
  year = {2016},