This is the official code release for R3-Transformer proposed in Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language.
All dependencies are included in the original model's container. First install the latest docker. Then pull our docker image by:
docker pull hassanhub/vid_cap:latest
Then run the container by:
docker run --gpus all --name r3_container -it -v /home/
Note: This image already includes CUDA-related drivers and dependencies.
Alternatively, you can create your own environment and make sure the following dependencies are installed:
Python 3.7/3.8
Tensorflow 2.3
CUDA 10.1
NVIDIA Driver v 440.100
CuDNN 7.6.5
opencv-python
h5py
transformers
matplotlib
scikit-image
nvidia-ml-py3
decord
pandas
tensorcore.dataflow
In order to speed-up data infeed, we utilize a multi-chunk hdf5 format. There are two options for getting data prepared for train/evaluation.
Download pre-extracted features using SlowFast-50-8x8 pre-trained on Kinetics 400 from this link:
- Parts 0-10 (coming soon...)
Alternatively, you can follow these steps to extract a customized version of features using your own visual backbone:
- Download YouCook II
- Download ActivityNet Captions
- Pre-process raw video files using this script
- Extract visual features using your visual backbone or our pre-trained SlowFast-50-8x8 using this script
- Store features and captions in a multi-chunk hdf5 format using this script