This repository corresponds to the PyTorch implementation of Multimodal Transformer with Multi-View Visual Representation for Image Captioning. By using the commonly used bottom-up-attention visual features, a single svbase model delivers 130.9 Cider on the Kapathy's test split of MSCOCO dataset. Please check our paper for details.
The annotations files can be downloaded here and unzipped to the datasets folder.
The bottom up features can be extracted by ours bottom-up-attention repo.
You can extract features using the following script:
# 1.extract the bbox of image
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
--image-dir <image_dir> --out-dir <bbox_dir> --resume
# 2. extract the roi feature by bbox
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
--image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <output_dir> --resume
You can compare the extracted features with the feature file provided by us to determine the correctness of the extracted features.
Finally, the datasets folders will have the following structure:
|-- datasets
|-- mscoco
| |-- features
| | |-- frcn-r101
| | | |-- train2014
| | | | |-- COCO_train2014_....npz
| | | |-- val2014
| | | | |-- COCO_val2014_....npz
| | | |-- test2015
| | | | |-- COCO_test2015_....npz
| |-- annotations
| | |-- coco-train-idxs.p
| | |-- coco-train-words.p
| | |-- cocotalk_label.h5
| | |-- cocotalk.json
| | |-- vocab.json
| | |-- glove_embeding.npy
The following script will train a model with cross-entropy loss :
$ python train.py --caption_model svbase --ckpt_path <checkpoint_dir> --gpu_id 0
-
caption_model
refers to the model while been trained, such as svbase, umv, umv3. -
ckpt_path
refers to the dir to save checkpoint. -
gpu_id
refers to the gpu id.
Based on the model trained with cross-entropy loss, the following script will load the pre-trained model and then fine-tune the model with self-critical loss:
$ python train.py --caption_model svbase --learning_rate 1e-5 --ckpt_path <checkpoint_dir> --start_from <checkpoint_dir_rl> --gpu_id 0 --max_epochs 25
-
caption_model
refers to the model while been trained. -
learning_rate
refers to the learning rate use in self-critical. -
ckpt_path
refers to the dir to save checkpoint. -
gpu_id
refers to the gpu id.
Given the trained model, the following script will test the performance on the val
split of MSCOCO:
$ python test.py --ckpt_path <checkpoint_dir> --gpu_id 0
-
ckpt_path
refers to the dir to save checkpoint. -
gpu_id
refers to the gpu id.
We provided pre-trained models here.
Model | Backbone | Bleu@1 | CIDEr | Meteor | Download |
---|---|---|---|---|---|
SV | ResNet-101 | 80.8 | 130.9 | 29.1 | model |
@article{yu2019multimodal,
title={Multimodal transformer with multi-view visual representation for image captioning},
author={Yu, Jun and Li, Jing and Yu, Zhou and Huang, Qingming},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2019},
publisher={IEEE}
}
We thank Ruotian Luo for his self-critical.pytorch, cider and coco-caption repos.