TVCaption

PyTorch implementation of MultiModal Transformer (MMT), a method for multimodal (video + subtitle) captioning.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

Resources

Data: TVC dataset
Website (with leaderboard): https://tvr.cs.unc.edu/tvc.html
Submission: codalab evaluation server
Related works: TVR, TVQA, TVQA+

Getting started

Prerequisites

Clone this repository

git clone --recursive https://github.com/jayleicn/TVCaption.git
cd TVCaption

Prepare feature files Download tvc_feature_release.tar.gz (23GB). After downloading the file, extract it to the data directory.

tar -xf path/to/tvc_feature_release.tar.gz -C data

You should be able to see video_feature under data/tvc_feature_release directory. It contains video features (ResNet, I3D, ResNet+I3D), these features are the same as the video features we used for TVR/XML. Read the code to learn details on how the features are extracted: video.

Install dependencies:

Python 2.7
PyTorch 1.1.0
nltk
easydict
tqdm
h5py
tensorboardX

Add project root to PYTHONPATH

source setup.sh

Note that you need to do this each time you start a new session.

Training and Inference

Build Vocabulary

bash baselines/transformer_captioning/scripts/build_vocab.sh

Running this command will build vocabulary cache/tvc_word2idx.json from TVC train set.

MMT training

bash baselines/multimodal_transformer/scripts/train.sh CTX_MODE VID_FEAT_TYPE

CTX_MODE refers to the context (video, sub, video_sub) we use. VID_FEAT_TYPE video feature type (resnet, i3d, resnet_i3d).

Below is an example of training MMT with both video and subtitle, where we use the concatenation of ResNet and I3D features for video.

bash baselines/multimodal_transformer/scripts/train.sh video_sub resnet_i3d

This code will load all the data (~30GB) into RAM to speed up training, use --no_core_driver to disable this behavior.

Training using the above config will stop at around epoch 22, around 3 hours on a single 2080Ti GPU server. You should get ~45.0 CIDEr-D and ~10.5 BLEU@4 scores on val set. The resulting model and config will be saved at a dir: baselines/multimodal_transformer/results/video_sub-res-*

MMT inference After training, you can inference using the saved model on val or test_public set:

bash baselines/multimodal_transformer/scripts/translate.sh MODEL_DIR_NAME SPLIT_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., video_sub-res-*. SPLIT_NAME could be val or test_public.

Evaluation and Submission

standalone_eval/README.md

Citations

If you find this code useful for your research, please cite our paper:

@inproceedings{lei2020tvr,
  title={TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval},
  author={Lei, Jie and Yu, Licheng and Berg, Tamara L and Bansal, Mohit},
  booktitle={Tech Report},
  year={2020}
}

Acknowledgement

This work is supported by This research is supported by NSF Awards #1633295, 1562098, 1405822, DARPA MCS Grant #N66001-19-2-4031, DARPA KAIROS Grant #FA8750-19-2-1004, Google Focused Research Award, and ARO-YIP Award #W911NF-18-1-0336.

This code is partly inspired by the following projects: OpenNMT-py, transformers.

Contact

jielei [at] cs.unc.edu

vyraun/TVCaption