Official codebase for paper Human-like Controllable Image Captioning with Verb-specific Semantic Roles (CVPR 2021).
The following dependencies should be enough. See vsr.yml for more environment settings.
h5py 2.10.0
python 3.6.10
pytorch 1.5.1
munkres 1.0.12
numpy 1.18.5
speaksee 0.0.1
tensorboardx 2.0
torchvision 0.6.1
tqdm 4.46.0
Install semantic role labeling tool with pip install allennlp==1.0.0 allennlp-models==1.0.0
.
The model we used is provided as bert-base-srl-2020.03.24.tar.gz. The latest version of this tool can be found in here.
You can follow the demo in AllenNLP_demo. And we will release the demo to show how to process data with this semantic role labeling tool.
The models can be downloaded from here and extracted into "./saved_model".
And other data can be downloaded from link1, link2 and extracted into "./datasets" and "./saved_data", repectively.
(using tar -xzvf *.tgz
)
The process code to generate those data will be released soon.
Flickr30k Entities: The detection feature can be downloaded from here and extracted into './datasets/flickr/flickr30k_detections.hdf5'.
COCO Entities: The detection feature can be downloaded from here and extracted into './datasets/coco/coco_detections.hdf5'. Refer to the show, control and tell for more information about COCO Entities.
We will release training code later.
Train GSRL model on two datasets: Flickr30k Entities and COCO Entities.
S-level SSP
# train s-level SSP on Flickr30k Entities
python flickr_scripts/train_region_sort_flickr.py --checkpoint_path saved_model/flickr_s_ssp
# train s-level SSP on COCO Entities
python coco_scripts/train_region_sort.py --checkpoint_path saved_model/coco_s_ssp
R-level SSP
# train sinkhorn model on Flickr30k Entities
python flickr_scripts/train_sinkhorn_flickr.py --checkpoint_path saved_model/flickr_sinkhorn
# train sinkhorn model on COCO Entities
python coco_scripts/train_sinkhorn.py --checkpoint_path saved_model/coco_sinkhorn
Firstly, train the captioning model with XE(cross-entropy), with
python coco_scripts/train.py --exp_name captioning_model --batch_size 100 --lr 5e-4
Next, further train it with RL with CIDEr reward, with
python coco_scripts/train.py --exp_name captioning_model --batch_size 100 --lr 5e-5 --sample_rl
Argument | Values |
---|---|
--det | whether use detection regions or use gt regions |
--gt | whether use gt verb or predicted verb |
python flickr_scripts/eval_flickr.py
python flickr_scripts/eval_flickr.py --gt
python flickr_scripts/eval_flickr.py --det
python flickr_scripts/eval_flickr.py --gt --det
python coco_scripts/eval_coco.py
python coco_scripts/eval_coco.py --gt
python coco_scripts/eval_coco.py --det
python coco_scripts/eval_coco.py --gt --det
If there is any question about our work, please contact us with issue or email us with zju_jiangzhihong@zju.edu.cn.
Please cite with the following Bibtex:
@inproceedings{chen2021vsr,
title={Human-like Controllable Image Captioning with Verb-specific Semantic Roles},
author={Chen, Long and Jiang, Zhihong and Xiao, Jun and Liu, Wei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}