Image captioning with scene graph

Deep learning project by Jiaheng Hu

Data preparation

Create a folder called 'data'

Unzip all files (captions and images) and place the folders in 'data' folder.

Set up the graph-rcnn.pytorch repository at https://github.com/jwyang/graph-rcnn.pytorch. Download the imp_relpn weight and place it in ./weights

Next type this command in a python environment:

cd bottom-up_features
python create_sg_h5.py

This will create the scene graph features.

Training

To train the model, type:

python train.py

Evaluation

To evaluate the model on the coco eval dataset, edit the eval.py file to include the model checkpoint location and then type:

python eval.py

Beam search is used to generate captions during evaluation. Beam search iteratively considers the set of the k best sentences up to time t as candidates to generate sentences of size t + 1, and keeps only the resulting best k of them. A beam search of five is used for inference.

The metrics reported are ones used most often in relation to image captioning and include BLEU-4, CIDEr, METEOR and ROUGE-L. Official MSCOCO evaluation scripts are used for measuring these scores.

To evaluate the mode on the nocaps datset, setup the updown baseline repo, then type:

python eval_nocaps.py
python eval_parse_tmp.py

The first script generates the prediction, while the second script reformat and submit the result

References

Code adapted with thanks from https://github.com/poojahira/image-captioning-bottom-up-top-down

JiahengHu/Updown_scenegraph

Image captioning with scene graph

Data preparation

Training

Evaluation

References