This repository corresponds to the PyTorch implementation of the paper Multimodal Transformer with Multi-View Visual Representation for Image Captioning. By using the bottom-up-attention visual features (with slight improvement), our single-view Multimodal Transformer model (MT_sv) delivers 130.9 CIDEr on the Kapathy's test split of MSCOCO dataset. Please check our paper for details.
The annotation files can be downloaded here and unzipped to the datasets folder.
The visual features are extracted by our bottom-up-attention.pytorch repo using the following scripts:
# 1.extract the bbox from the image
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
--image-dir <image_dir> --out-dir <bbox_dir> --resume
# 2. extract the roi feature by bbox
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
--image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <output_dir> --resume
We provided a pre-extracted features in the datasets/mscoco/features/val2014
folder for the image in datasets/mscoco/image
to help validating the correctness of the extracted features.
We use the ResNet-101 as our backbone and extract features for the MSCOCO dataset to the datasets/mscoco/features/frcn-r101
folder.
Finally, the datasets
folder will have the following structure:
|-- datasets
|-- mscoco
| |-- features
| | |-- frcn-r101
| | | |-- train2014
| | | | |-- COCO_train2014_....jpg.npz
| | | |-- val2014
| | | | |-- COCO_val2014_....jpg.npz
| | | |-- test2015
| | | | |-- COCO_test2015_....jpg.npz
| |-- annotations
| | |-- coco-train-idxs.p
| | |-- coco-train-words.p
| | |-- cocotalk_label.h5
| | |-- cocotalk.json
| | |-- vocab.json
| | |-- glove_embeding.npy
The following script will train a model with cross-entropy loss :
$ python train.py --caption_model svbase --ckpt_path <checkpoint_dir> --gpu_id 0
-
caption_model
refers to the model while been trained, such assvbase
andumv
-
ckpt_path
refers to the dir to save checkpoint. -
gpu_id
refers to the gpu id.
Based on the model trained with cross-entropy loss, the following script will load the pre-trained model and then fine-tune the model with self-critical loss:
$ python train.py --caption_model svbase --learning_rate 1e-5 --ckpt_path <checkpoint_dir> --start_from <checkpoint_dir_rl> --gpu_id 0 --max_epochs 25
-
caption_model
refers to the model while been trained. -
learning_rate
refers to the learning rate use in self-critical. -
ckpt_path
refers to the dir to save checkpoint. -
gpu_id
refers to the gpu id.
Given the trained model, the following script will report the performance on the val
split of MSCOCO:
$ python test.py --ckpt_path <checkpoint_dir> --gpu_id 0
-
ckpt_path
refers to the dir to save checkpoint. -
gpu_id
refers to the gpu id.
We provided the pre-trained model for the single-view MT model at present. More models will be added in the future.
Model | Backbone | BLEU@1 | METEOR | CIDEr | Download |
---|---|---|---|---|---|
MT_sv | ResNet-101 | 80.8 | 29.1 | 130.9 | model |
If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:
@article{yu2019multimodal,
title={Multimodal transformer with multi-view visual representation for image captioning},
author={Yu, Jun and Li, Jing and Yu, Zhou and Huang, Qingming},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2019},
publisher={IEEE}
}
We thank Ruotian Luo for his self-critical.pytorch, cider and coco-caption repos.