MAGVIT: Masked Generative Video Transformer

Official code and models for the CVPR 2023 paper:

MAGVIT: Masked Generative Video Transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
CVPR 2023

Summary

We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

If you find this code useful in your research, please cite

@inproceedings{yu2023magvit,
  title={{MAGVIT}: Masked generative video transformer},
  author={Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, Jos{\'e} and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

Disclaimers

Please note that this is not an officially supported Google product.

Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review terms and conditions made available by third parties before using models and datasets provided.

Installation

There is a conda environment file for running with GPUs. CUDA 11 and CuDNN 8.6 is required for JAX. This VM Image has been tested.

conda env create -f environment.yaml
conda activate magvit

Alternatively, you can install the dependencies via

pip install -r requirements.txt

Pretrained models

Model weights and loading instructions are coming soon.

MAGVIT 3D-VQ models

Model	Size	Input	Output	Codebook size	Dataset
3D-VQ	B	16 frames x 64x64	4x16x16	1024	BAIR Robot Pushing
3D-VQ	L	16 frames x 64x64	4x16x16	1024	BAIR Robot Pushing
3D-VQ	B	16 frames x 128x128	4x16x16	1024	UCF-101
3D-VQ	L	16 frames x 128x128	4x16x16	1024	UCF-101
3D-VQ	B	16 frames x 128x128	4x16x16	1024	Kinetics-600
3D-VQ	L	16 frames x 128x128	4x16x16	1024	Kinetics-600
3D-VQ	B	16 frames x 128x128	4x16x16	1024	Something-Something-v2
3D-VQ	L	16 frames x 128x128	4x16x16	1024	Something-Something-v2

MAGVIT transformers

Each transformer model must be used with its corresponding 3D-VQ tokenizer of the same dataset and model size.

Model	Task	Size	Dataset	FVD
Transformer	Class-conditional	B	UCF-101	159
Transformer	Class-conditional	L	UCF-101	76
Transformer	Frame prediction	B	BAIR Robot Pushing	76 (48)
Transformer	Frame prediction	L	BAIR Robot Pushing	62 (31)
Transformer	Frame prediction (5)	B	Kinetics-600	24.5
Transformer	Frame prediction (5)	L	Kinetics-600	9.9
Transformer	Multi-task-8	B	BAIR Robot Pushing	32.8
Transformer	Multi-task-8	L	BAIR Robot Pushing	22.8
Transformer	Multi-task-10	B	Something-Something-v2	43.4
Transformer	Multi-task-10	L	Something-Something-v2	27.3