/dense-image-captioning

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 with deformable adaptive attention

Primary LanguageJupyter NotebookMIT LicenseMIT

dense image captioning

An unofficial Torch implementation of J. Lu, C. Xiong, et al., Knowing when to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 trained on the COCO image captioning and Flickr30k datasets.

The implementation presents the following variations from the paper:

  • deformable adaptive attention;
  • larger visual sentinel size (128-dim);
  • model eval against the SPICE metric;
  • MCTS-based decoding.

Introduction

The role of image dense captioning is immense for enabling visual-language understanding of the outer world.

In this project we propose a deformable variant of the visual sentinel via adaptive attention introduced in the reference paper for estimating grounding probas which allows larger networks to be constructed while running at a faster inference speed and training for almost half the epochs with equal performance.

This project is part of a larger venture for the development of visual-language aid tools for visually-impaired people, by combining speech recognition, speech synthesis, image captioning and familiar person identification.

For more information, see the attached in-depth report.

Training

The model was trained for 50 epochs on a multi-GPU HPC cluster courtesy of CERN.

Usage

The following files must be downloaded from Google Drive:

The former contains the dataset with COCO-like annotations and the corresponding vocabulary.

The following files should be downloaded from Google Driver for display purposes:

N.B.: If the provided links are not longer available, contact the authors.

Authors