In this work, we present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture that leverages object-centric information to efficiently model temporal dynamics in videos. We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos, with an added modification to model object-centric information via bounding boxes. Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information. When evaluated on various difficult object-centric datasets, our method achieves better or equal performance to other video generation models, while remaining computationally more efficient and scalable. In addition, we show that our method is able to perform object-centric controllability through bounding box manipulation.
conda create -n povt python=3.7
conda activate povt
conda install --yes pytorch=1.10.1 torchvision=0.11.2 torchaudio=0.10.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
pip install -e .
You can download each dataset from the following links (you may need to pip install internetarchive
)
Example dataset video can be visualized with scripts/visualize_dataset.py
Download CATER (with masks) from the Deepmind Multi-Object Dataset.
To convert it to HDF5 format, run python data_gen/cater_cam/cater_cam_to_hdf5.py /PATH/TO/DATASET
Then, to compute bounding boxes, run python data_gen/cater_cam/compute_cater_cam_bboxes.py -d data/cater_cam.hdf5
We use a modified version of the original CATER repo to render segmentation masks. You will need to first download blender 2.79b and install the appropriate python libraries, or use their provided singularity container. Run cd CATER/generate
and generate samples with python launch.py --num_jobs 4
(total number of jobs with be # of GPUS * num_jobs). You may need to set PYTHONPATH
to the generate
directory. The dataset will be saved to CATER/generate/datasets
To convert it to HDF5 format, run python data_gen/cater_mov_to_hdf5.py -d /PATH/TO/cater_mov
to produce cater_mov.hdf5
We use a modified version of the original SORT repo. Please follow the steps below:
- Download Something-Something V2 found here into a folder
povt/data/something-something
cd povt/data/something-something
and extract data files withcat 20bn-something-something-v2-?? | tar zx
- Download each bounding box annotation file found here, and place the
.json
files intopovt/data/something-something/annotations
cd povt
and runpython data_gen/ssv2/read_bboxes.py data/something-something/
cd sort
and runpython bbox_to_txt.py /PATH/TO/something-something
.- You may need to first install SORT dependencies from
sort/requirements.txt
. Then runpython sort.py --seq_path /PATH/TO/something-something/
- Move all generated
.txt
files to replace the ones in/PATH/TO/something-something/20bn-something-something-v2
with the commandfind output/ -name '*.txt' -exec mv {} /PATH/TO/something-something/20bn-something-something-v2 \;
cd povt
and runpython data_gen/ssv2/filter.py
To train a VQ-VAE with 2 GPUs, run CUDA_VISIBLE_DEVICES=0,1 python scripts/vqvae/run.py -o <exp_name> -c configs/base_vqvae.yaml
Reconstructions can be visualized with scripts/visualize_reconstructions.py
Make sure to fill configs/base_povt.yaml
with the correct VQ-VAE checkpoint. To train POVT with 4 GPUs, run CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/gpt/run.py -o <exp_name> -c configs/base_povt.yaml
Samples can be generated with scripts/sample.py
. Note that the default code performs single-frame conditional video prediction.