Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling Strategy
Table of Contents
Description
This repo contains the code of Semantics-Assisted Video Captioning Model, based on the paper "Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling Strategy". It is under review at Frontiers in Robotics and AI.
We propose three ways to improve the video captioning model. First of all, we utilize both spatial features and dynamic spatio-temporal features as inputs for semantic detection network in order to generate meaningful semantic features for videos. Then, we propose a scheduled sampling strategy which gradually transfers the training phase from a teacher guiding manner towards a more self teaching manner. At last, the ordinary logarithm probability loss function is leveraged by sentence length so that short sentence inclination is alleviated. Our model achieves state-of-the-art results on the Youtube2Text dataset and is competitive with the state-of-the-art models on the MSR-VTT dataset.
The overall structure of our model looks like this . Here is some captions generated by our model.
If you need a newer and more powerful model, please refer to Delving-Deeper-into-the-Decoder-for-Video-Captioning.
Dependencies
- Python3.6
- TensorFlow 1.13
- NumPy
- sklearn
- pycocoevalcap(Python3)
Manual
- Make sure you have installed all the required packages.
- Download pycocoevalcap and put it along with
msrvttt
,msvd
,tagging
folders. - Download files in the Data section.
cd path_to_directory_of_model; mkdir saves
run_model.sh
is used for training models andtest_model.sh
is used for testing models. Specify the GPU you want to use by modifyingCUDA_VISIBLE_DEVICES
value. Specify the needed data paths by modifyingcorpus
,ecores
,tag
andref
values. The words will be sampled by argmax strategy ifargmax
is 1 and they will be sampled by multinomial strategy ifargmax
is 0.name
is the name which you give to the model.test
refers to the path of the saved model which is to be tested. Do not give a parameter totest
if you want to train a model.- After completing the configuration of the bash file, then
bash run_model.sh
for training,bash test_model.sh
for testing.
Results
Comparison on Youtube2Text
Model | B-4 | C | M | R | Overall |
---|---|---|---|---|---|
LSTM-E | 45.3 | 31.0 | |||
h-RNN | 49.9 | 65.8 | 32.6 | ||
aLSTMs | 50.8 | 74.8 | 33.3 | ||
SCN | 51.1 | 77.7 | 33.5 | ||
MTVC | 54.5 | 92.4 | 36.0 | 72.8 | 0.9198 |
ECO | 53.5 | 85.8 | 35.0 | ||
SibNet | 54.2 | 88.2 | 34.8 | 71.7 | 0.8969 |
Our Model | 61.8 | 103.0 | 37.8 | 76.8 | 1.0000 |
Comparison on MSR-VTT
Model | B-4 | C | M | R | Overall |
---|---|---|---|---|---|
v2t_navigator | 40.8 | 44.8 | 28.2 | 60.9 | 0.9325 |
Aalto | 39.8 | 45.7 | 26.9 | 59.8 | 0.9157 |
VideoLAB | 39.1 | 44.1 | 27.7 | 60.6 | 0.9140 |
MTVC | 40.8 | 47.1 | 28.8 | 60.2 | 0.9459 |
CIDEnt-RL | 40.5 | 51.7 | 28.4 | 61.4 | 0.9678 |
SibNet | 40.9 | 47.5 | 27.5 | 60.2 | 0.9374 |
HACA | 43.4 | 49.7 | 29.5 | 61.8 | 0.9856 |
TAMoE | 42.2 | 48.9 | 29.4 | 62.0 | 0.9749 |
Our Model | 43.8 | 51.4 | 28.9 | 62.4 | 0.9935 |
Data
MSVD
- MSVD reference file(THUCloud)
(GoogleDrive)
- MD5 9101af3b24c27e0c63b98ac55511d04c
- SHA-1 3cd3b556c06e8f944f0f01ae8ac03a262dd0af04
- MSVD ResNeXt ECO file(THUCloud)
(GoogleDrive)
- This is the video-level feature extracted by ResNeXt101c64 and Efficient Convolutional Network.
- MD5 b4307b302f8d9754e6e7dac284da0625
- SHA-1 a34ab8c1b3fc2ecfae43d454ddd3f6eddccbeb1a
- MSVD Semantic tag file(THUCloud)
(GoogleDrive)
- MD5 df5dd440bf6ad78a3266dfbf9018d01e
- SHA-1 0fb63a3b381c56baf982bbb4fc46027f26a45b02
- MSVD Corpus file(THUCloud)
(GoogleDrive)
- MD5 0161e1d3207f10f7e13f36a10ae81c4f
- SHA-1 a57c184f80ab2962ffdee71391fc6692c9a42c4b
- MSVD tag ground truth for MSVD tag model(THUCloud)
(GoogleDrive)
- MD5 e2d66ae3f2c28f25071c8ed4591ee9bb
- SHA-1 351ed0f794ac4ff17ff96a7984a259b1843fa2f0
- MSRVTT tag ground truth for MSVD tag model(THUCloud)
(GoogleDrive)
- The previous two files are used to train the tagging network.
- MD5 d899e591940926fdbe97dd756a6b1cd8
- SHA-1 4a8b7c72c2ba9e525ef5cdce27c61e4eff22bb5e
- MSVD tag index2word and word2index mappings(ExternalRepo)
- We use the same word-index mapping in semantic tag to the code in this link.
- Video name to numerical index mapping file of MSVD(THUCloud)
(GoogleDrive)
- data type: dict, each key is a video name and the corresponding value is the video index.
- MD5 ee3ef82df50694db629297fd60fd7427
- SHA-1 99240d9e91cb6378fded8a6702301e390ffc17fc
- Model Checkpoint:
- meta(THUCloud)
(GoogleDrive)
- MD5 5d8a0510b20734cf58df9554fd421c50
- SHA-256 b3ce468e35cab0b409b6de8379fe4db89a439756ca9a33f16e4724b245c3174b
- index(THUCloud)
(GoogleDrive)
- MD5 0867cdf66639e9f6fc1f5d1c78b7d05e
- SHA-256 462ca7467593393e28f211fb2554e34f80c23e2e3d8e1f49668530064cb7abae
- data-00000-of-00001(THUCloud)
(GoogleDrive)
- MD5 1c3a83f1c9a0a38e30d4ab22af77377d
- SHA-256 caeaf341f86599c4acef0a67d893f040fa0d287dde8cac7db70bb3e24a10b68f
- meta(THUCloud)
(GoogleDrive)
MSRVTT
- MSRVTT reference file(THUCloud)
(GoogleDrive)
- MD5 2ca68300ab2440ab0f6972ea12a0f323
- SHA-1 024be7b58fd26c5add388e42210170484f0e86cf
- MSRVTT ResNeXt ECO file(THUCloud)
(GoogleDrive)
- This is the video-level feature extracted by ResNeXt101c64 and Efficient Convolutional Network.
- MD5 d0f05df7d113e4914ab9981d03c7dc70
- SHA-1 258e3e92462469dbaf97808f2ca1eb8369f0930b
- MSRVTT Semantic tag file(THUCloud)
(GoogleDrive)
- MD5 e41fd8fe8e198a6578c84db273ca8bd9
- SHA-1 2129dccd7a67a3d56d803e9f4c032da9b7e81742
- MSRVTT Corpus file
(THUCloud)
(GoogleDrive)
- MD5 eba8f53dc1fc1f91bc1d434326964366
- SHA-1 5a63a2d4215b6894cb6d8c9319c133896ece3ea2
- MSVD tag ground truth for MSRVTT tag model
(THUCloud)
(GoogleDrive)
- MD5 c9ce9007ef754338d89e40018d14c923
- SHA-1 03719c1324f9299a510c8c380352b9f4b7125878
- MSRVTT tag ground truth for MSRVTT tag model
(THUCloud)
(GoogleDrive)
- The previous two files are used to train the tagging network.
- MD5 e7d6defa3278bc91ecd7dcbdbda16649
- SHA-1 2eea2721fcf744d8e42b6ef95c1e2b481534c5aa
- MSRVTT tag word2index and index2word
(THUCloud)
(GoogleDrive)
- This file contains word-to-index mapping and index-to-word mapping.
- SHA-256 50b00cebb12a38c0c4a546c577c99c12e97c8961b4f2ce9472b77d4ad05e1226
- Model Checkpoint:
- meta
(THUCloud)
(GoogleDrive)
- MD5 cc4e6775bf0eec75b06e6e6fecfd5eb6
- SHA-256 f0e1cca6d4186756b6a7f062a0a9824357fe1d37dfb984ad09ceda2a7db01fac
- index
(THUCloud)
(GoogleDrive)
- MD5 0da1d37d62ca514c9b31f6c8fced4559
- SHA-256 ab6f16f94b91d824b8777dda40a67770e0d73b548f556d1f6c169a7862fb3906
- data-00000-of-00001
(THUCloud)
(GoogleDrive)
- MD5 5f28126cc6a2e38fe7919715cbc4bb7e
- SHA-256 74ad87644a9380ef8fca0f2beee85e0ba87a9bfe20e05e4e93dfe4d876a7f167
- meta
(THUCloud)
(GoogleDrive)
- MSR-VTT Dataset:
- train_val_test_annotation.zip
(THUCloud)
(GoogleDrive)
- SHA-256: ce2d97dd82d03e018c6f9ee69c96eb784397d1c83f734fdb8c17aafa5e27da31
- msr-vtt-v1.part1.rar
(THUCloud)
(GoogleDrive)
- SHA-256: 3445e0d1bffda3739110dfcf14182b63222731af8a4d7153f0ac09dbec39a0d3
- msr-vtt-v1.part2.rar
(THUCloud)
(GoogleDrive)
- SHA-256: b550997526272ab68a42f1bd93315aa2bbb521c71f33d0cb922fbbfb86f15aae
- msr-vtt-v1.part3.rar
(THUCloud)
(GoogleDrive)
- SHA-256: debbd0e535e77d9927ffb375299c08990519e22ba7dac542b464b70d440ef515
- train_val_test_annotation.zip
(THUCloud)
(GoogleDrive)
ECO
- Source Code: GitHub.
- ECO_full_kinetics.caffemodel
(THUCloud)
(GoogleDrive)
- MD5 31ed18d5eadfd59cb65b7dcdadc310b4
- SHA-1 b749384d2dac102b8035965566e3030fce465c20
Citation
@ARTICLE{2019arXiv190900121C,
author = {{Chen}, Haoran and {Lin}, Ke and {Maye}, Alexander and {Li}, Jianming and
{Hu}, Xiaolin},
title = "{A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling}",
journal = {arXiv e-prints},
keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language},
year = "2019",
month = "Aug",
eid = {arXiv:1909.00121},
pages = {arXiv:1909.00121},
archivePrefix = {arXiv},
eprint = {1909.00121},
primaryClass = {cs.CV},
adsurl = {https://ui.adsabs.harvard.edu/abs/2019arXiv190900121C},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}