This repo holds the codes and models for the end-to-end captioning method presented on WACV 2019
End-to-End Video Captioning with Multitask Reinforcement Learning
Lijun Li, Boqing Gong
If you use our code, please cite our paper.
- This code requires tensorflow1.1.0. The evaluation code is in Python, and you need to install coco-caption evaluation if you want to evaluate the model.
use the following to clone to your local machine
git clone https://github.com/adwardlee/multitask-end-to-end-video-captioning.git
We support experimenting with two publicly available datasets for video captioning: MSVD & MSR-VTT.
It needs to extract the frames by using cpu_extract.py
. Then use read_certrain_number_frame.py
to uniformly sample 5 frames from all frames of a video. At last use the tf_feature_extract.py
and modify the model path to extract the inception-resnet-v2 features of frame.
use the *_s2vt.py
. Before that, it needs to change the model path of evaluation function and some global parameters in the file. For example,
python tf_s2vt.py --gpu 0 --task train
Using the pretrained model from step 1 and then
python reinforcement_multisampling_tf_s2vt.py --task train
Using the pretrained model from step 2 and then
python reinforce_multitask_e2e_attribute_s2vt.py --task train
use the *_s2vt.py
. Before that, it needs to change the model path of evaluation function and some global parameters in the file. For example,
python tf_s2vt.py --gpu 0 --task evaluate
for testing the pretrained models, please refer to Repo
The MSVD models can be downloaded from here The MSR-VTT models can be downloaded from here
we also apply temporal attention in tensorflow