PyTorch implementation of video captioning

Requirements

Pretrained Model

VGG16 pretrained on ImageNet [PyTorch version]: https://download.pytorch.org/models/vgg16-397923af.pth
Resnet-101 pretrained on ImageNet [PyTorch version]: https://github.com/ruotianluo/pytorch-resnet

Datasets

Obtain the dataset you need:

MSR-VTT: train_val_videos.zip, train_val_annotation.zip, test_videos.zip, test_videodatainfo.json
Flickr30k: flickr30k.tar.gz, flickr30k-images.tar

Packages

torch, torchvision, numpy, scikit-image, nltk, h5py, pandas, future  # python2 only
tensorboard_logger  # for use tensorboard to view training loss

You can use:

sudo pip install -r requirements.txt

To install all the above packages.

Usage

Preparing Data

Firstly, we should make soft links to the dataset folder and pretrained models. For example:

mkdir datasets
ln -s YOUR_DATASET_PATH datasets/MSVD
mkdir models
ln -s YOUR_CNN_MODEL_PATH models/

Some details can be found in opts.py. Then we can:

Prepare video feature:

python scripits/prepro_video_feats.py

Prepare caption feature and dataset split:

python scripts/prepro_caption_feats.py

Training and Testing

Before training the model, please make sure you can use GPU to accelerate computation in PyTorch. Some parameters, such as batch size and learning rate, can be found in args.py.

Train:

python train.py

Evaluate:

python evaluate.py

Sample some examples:

python sample.py

Tsingzao/video2text.pytorch