The Pytorch implementation for "Video-Text Pre-training with Learned Regions" (arxiv)
We are still cleaning up the code further and preparing for pre-training weights.
Overall, this code is built on PyTorch with DistributedDataParallel (DDP).
- Create conda env and install required packages via
sh setup_myEnv.sh
- Create some important folders
mkdir data
(you can symlink huge datasets to this folder)mkdir meta_data
(put meta data of each dataset here)mkdir results
- Download Pre-training data
- Download WebVid-2M (see https://github.com/m-bain/webvid)
- Download CC3M (see https://ai.google.com/research/ConceptualCaptions/download)
PS: Not all videos are avaible so that you need to modify the metadata depend on your case. We also provide our metadata in here.
- Run
sh pre-training.sh
(Commands with different settings are listed in this script.)
- Download data (see https://github.com/m-bain/frozen-in-time#-finetuning-benchmarks-msr-vtt)
- Run
sh fine-tune.sh
.
This code is based off Frozen in Time
@article{yan2021video,
title={Video-Text Pre-training with Learned Regions},
author={Yan, Rui and Shou, Mike Zheng and Ge, Yixiao and Wang, Alex Jinpeng and Lin, Xudong and Cai, Guanyu and Tang, Jinhui},
journal={arXiv preprint arXiv:2112.01194},
year={2021}
}