- VGG16 pretrained on ImageNet [PyTorch version]: https://download.pytorch.org/models/vgg16-397923af.pth
- Resnet-101 pretrained on ImageNet [PyTorch version]: https://github.com/ruotianluo/pytorch-resnet
- MSVD: https://www.microsoft.com/en-us/download/details.aspx?id=52422
- MSR-VTT: http://ms-multimedia-challenge.com/2017/dataset
Obtain the dataset you need:
-
MSR-VTT: train_val_videos.zip, train_val_annotation.zip, test_videos.zip, test_videodatainfo.json
-
Flickr30k: flickr30k.tar.gz, flickr30k-images.tar
torch, torchvision, numpy, scikit-image, nltk, h5py, pandas, future # python2 only
tensorboard_logger # for use tensorboard to view training loss
You can use:
sudo pip install -r requirements.txt
To install all the above packages.
Firstly, we should make soft links to the dataset folder and pretrained models. For example:
mkdir datasets
ln -s YOUR_DATASET_PATH datasets/MSVD
mkdir models
ln -s YOUR_CNN_MODEL_PATH models/
Some details can be found in opts.py. Then we can:
- Prepare video feature:
python scripits/prepro_video_feats.py
- Prepare caption feature and dataset split:
python scripts/prepro_caption_feats.py
Before training the model, please make sure you can use GPU to accelerate computation in PyTorch. Some parameters, such as batch size and learning rate, can be found in args.py.
- Train:
python train.py
- Evaluate:
python evaluate.py
- Sample some examples:
python sample.py
1.Supervising Neural Attention Models for Video Captioning by Human Gaze Data