Multimodal Transformer with Multi-View Visual Representation for Image Captioning

This repository corresponds to the PyTorch implementation of Multimodal Transformer with Multi-View Visual Representation for Image Captioning. By using the commonly used bottom-up-attention visual features, a single svbase model delivers 130.9 Cider on the Kapathy's test split of MSCOCO dataset. Please check our paper for details.

Prerequisites
Training
Testing

Prerequisites

Requirements

Python 3
PyTorch >= 1.4.0
Cuda >= 9.2 and cuDNN

The annotations files can be downloaded here and unzipped to the datasets folder.

The bottom up features can be extracted by ours bottom-up-attention repo.

You can extract features using the following script:

# 1.extract the bbox of image
$ python3 extract_features.py --mode caffe \
          --config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
          --image-dir <image_dir> --out-dir <bbox_dir> --resume

# 2. extract the roi feature by bbox
$ python3 extract_features.py --mode caffe \
         --config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
         --image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <output_dir> --resume

You can compare the extracted features with the feature file provided by us to determine the correctness of the extracted features.

Finally, the datasets folders will have the following structure:

|-- datasets
   |-- mscoco
   |  |-- features
   |  |  |-- frcn-r101
   |  |  |  |-- train2014
   |  |  |  |  |-- COCO_train2014_....npz
   |  |  |  |-- val2014
   |  |  |  |  |-- COCO_val2014_....npz
   |  |  |  |-- test2015
   |  |  |  |  |-- COCO_test2015_....npz
   |  |-- annotations
   |  |  |-- coco-train-idxs.p
   |  |  |-- coco-train-words.p
   |  |  |-- cocotalk_label.h5
   |  |  |-- cocotalk.json
   |  |  |-- vocab.json
   |  |  |-- glove_embeding.npy

Training

The following script will train a model with cross-entropy loss :

$ python train.py --caption_model svbase --ckpt_path <checkpoint_dir> --gpu_id 0

caption_model refers to the model while been trained, such as svbase, umv, umv3.
ckpt_path refers to the dir to save checkpoint.
gpu_id refers to the gpu id.

Based on the model trained with cross-entropy loss, the following script will load the pre-trained model and then fine-tune the model with self-critical loss:

$ python train.py --caption_model svbase --learning_rate 1e-5 --ckpt_path <checkpoint_dir> --start_from <checkpoint_dir_rl> --gpu_id 0 --max_epochs 25

caption_model refers to the model while been trained.
learning_rate refers to the learning rate use in self-critical.
ckpt_path refers to the dir to save checkpoint.
gpu_id refers to the gpu id.

Testing

Given the trained model, the following script will test the performance on the val split of MSCOCO:

$ python test.py --ckpt_path <checkpoint_dir> --gpu_id 0

ckpt_path refers to the dir to save checkpoint.
gpu_id refers to the gpu id.

Pre-trained models

We provided pre-trained models here.

Model	Backbone	Bleu@1	CIDEr	Meteor	Download
SV	ResNet-101	80.8	130.9	29.1	model

Citation

@article{yu2019multimodal,
  title={Multimodal transformer with multi-view visual representation for image captioning},
  author={Yu, Jun and Li, Jing and Yu, Zhou and Huang, Qingming},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2019},
  publisher={IEEE}
}

Acknowledgement

We thank Ruotian Luo for his self-critical.pytorch, cider and coco-caption repos.

JimLee4530/mt-captioning