Image captioning using Bottom-up, Top-down Attention

This is the PyTorch implementation of Are scene graphs good enough to improve Image Captioning?. Training and evaluation is done on the MSCOCO Image captioning challenge dataset. Bottom up features for MSCOCO dataset are extracted using Faster R-CNN object detection model trained on Visual Genome dataset. Pretrained bottom-up features are downloaded from here.

This Repository is designed with every different model design in a different branch. The name of the branch indicates what the model design is. Iti s best to avoid the Main branch currently, since this is outdated.

TODO: Clean Up the codebase to be contained in a single main branch. Planned to work on this, during the summer holidays

Data preparation

Create a folder called 'data'

Create a folder called 'final_dataset'

Download the MSCOCO Training (13GB) and Validation (6GB) images.

Also download Andrej Karpathy's training, validation, and test splits. This zip file contains the captions.

Unzip all files and place the folders in 'data' folder.

Next, download the bottom up image features. We used the fixed 36 regions version.

Unzip the folder and place unzipped folder in 'bottom-up_features' folder.

Next type this command in a python environment:

python bottom-up_features/tsv.py

This command will create the following files -

An HDF5 file containing the bottom up image features for train and val splits, 36 per image for each split, in an I, 36, 2048 tensor where I is the number of images in the split.
PKL files that contain training and validation image IDs mapping to index in HDF5 dataset created above.

optionally for the scene graphs, also run the following:

python create_input_files.py

this will create the following similar HDF5 and PKL files.

Move these files to the folder 'final_dataset'.

Next, type this command. If you dont want to prepare the scene-graph features, remove the -s flag:

python create_input_files.py -s

This command will create the following files -

A JSON file for each split containing the order in which to load the bottom up image features so that they are in lockstep with the captions loaded by the dataloader.
A JSON file for each split with a list of N_c * I encoded captions, where N_c is the number of captions sampled per image. These captions are in the same order as the images in the HDF5 file. Therefore, the ith caption will correspond to the i // N_cth image.
A JSON file for each split with a list of N_c * I caption lengths. The ith value is the length of the ith caption, which corresponds to the i // N_cth image.
A JSON file which contains the word_map, the word-to-index dictionary.

Although we make use of the official COCO captioning evaluation scripts, for legacy kept the nl_eval_master folder.

Next, go to nlg_eval_master folder and type the following two commands:

pip install -e .
nlg-eval --setup

This will install all the files needed for evaluation.

Training

To train the bottom-up top down model, type:

python train.py

Evaluation

To evaluate the model on the karpathy test split, edit the eval.py file to include the model checkpoint location and then type:

python eval.py

Beam search is used to generate captions during evaluation. Beam search iteratively considers the set of the k best sentences up to time t as candidates to generate sentences of size t + 1, and keeps only the resulting best k of them. A beam search of five is used for inference.

The metrics reported are ones used most often in relation to image captioning and include BLEU-4, CIDEr, METEOR and ROUGE-L. Official MSCOCO evaluation scripts are used for measuring these scores.

References

Code adapted with thanks from https://github.com/poojahira/image-captioning-bottom-up-top-down

Alqatf/butd-image-captioning