This is the implementation of paper Discriminability objective for training descriptive captions.
Python 2.7 (because there is no coco-caption version for python 3)
PyTorch 1.0 (along with torchvision)
java 1.8 for (coco-caption)
git clone --recursive https://github.com/ruotianluo/DiscCaptioning.git
In this paper we use the data split from Context-aware Captions from Context-agnostic Supervision. It's different from standard karpathy's split, so we need to download different files.
Download link: Google drive link
To train on your own, you only need to download dataset_coco.json
, but it's also suggested to download cocotalk.json
and cocotalk_label.h5
as well. If you want to run pretrained model, you have to download all three files.
cd coco-caption
bash ./get_stanford_models.sh
cd annotations
# Download captions_val2014.json from the google drive link above to this folder
cd ../../
The reason why we need to replace the captions_val2014.json
is because the original file can only evaluate images from the val2014 set, and we are using rama's split.
In this paper, for retrieval model, we use outputs of last layer of resnet-101. For captioning model, we use the bottom-up feature from https://arxiv.org/abs/1707.07998.
The features can be downloaded from the same link, and you need to compress them to data/cocotalk_fc
and data/cocobu_att
respectively.
Download pretrained models from link. Decompress them into root folder.
To evaluate on pretrained model, run:
bash eval.sh att_d1 test
The pretrained models can match the results shown in the paper.
Preprocess the captions (skip if you already have 'cocotalk.json' and 'cocotalk_label.h5'):
$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk
Preprocess for self-critical training:
$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train
First train a retrieval model:
bash run_fc_con.sh
Second, pretrain the captioning model.
bash run_att.sh
Third, finetune the captioning model with cider+discriminability optimization:
bash run_att_d.sh 1
(1 is the discriminability weight, and can be changed to other values)
bash eval.sh att_d1 test
If you found this useful, please consider citing:
@InProceedings{Luo_2018_CVPR,
author = {Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
title = {Discriminability Objective for Training Descriptive Captions},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}
The code is based on ImageCaptioning.pytorch