/magvit

Official JAX implementation of MAGVIT: Masked Generative Video Transformer

Primary LanguagePythonApache License 2.0Apache-2.0

MAGVIT: Masked Generative Video Transformer

PWC

PWC

PWC

PWC

PWC

PWC

[Paper] | [Project Page] | [Colab]

Official code and models for the CVPR 2023 paper:

MAGVIT: Masked Generative Video Transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
CVPR 2023

Summary

We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

If you find this code useful in your research, please cite

@inproceedings{yu2023magvit,
  title={{MAGVIT}: Masked generative video transformer},
  author={Yu, Lijun and Cheng, Yong and Sohn, Kihyuk and Lezama, Jos{\'e} and Zhang, Han and Chang, Huiwen and Hauptmann, Alexander G and Yang, Ming-Hsuan and Hao, Yuan and Essa, Irfan and Jiang, Lu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

Disclaimers

Please note that this is not an officially supported Google product.

Checkpoints are based on training with publicly available datasets. Some datasets contain limitations, including non-commercial use limitations. Please review terms and conditions made available by third parties before using models and datasets provided.

Installation

There is a conda environment file for running with GPUs. CUDA 11 and CuDNN 8.6 is required for JAX. This VM Image has been tested.

conda env create -f environment.yaml
conda activate magvit

Alternatively, you can install the dependencies via

pip install -r requirements.txt

Pretrained models

Model weights and loading instructions are coming soon.

MAGVIT 3D-VQ models

Model Size Input Output Codebook size Dataset
3D-VQ B 16 frames x 64x64 4x16x16 1024 BAIR Robot Pushing
3D-VQ L 16 frames x 64x64 4x16x16 1024 BAIR Robot Pushing
3D-VQ B 16 frames x 128x128 4x16x16 1024 UCF-101
3D-VQ L 16 frames x 128x128 4x16x16 1024 UCF-101
3D-VQ B 16 frames x 128x128 4x16x16 1024 Kinetics-600
3D-VQ L 16 frames x 128x128 4x16x16 1024 Kinetics-600
3D-VQ B 16 frames x 128x128 4x16x16 1024 Something-Something-v2
3D-VQ L 16 frames x 128x128 4x16x16 1024 Something-Something-v2

MAGVIT transformers

Each transformer model must be used with its corresponding 3D-VQ tokenizer of the same dataset and model size.

Model Task Size Dataset FVD
Transformer Class-conditional B UCF-101 159
Transformer Class-conditional L UCF-101 76
Transformer Frame prediction B BAIR Robot Pushing 76 (48)
Transformer Frame prediction L BAIR Robot Pushing 62 (31)
Transformer Frame prediction (5) B Kinetics-600 24.5
Transformer Frame prediction (5) L Kinetics-600 9.9
Transformer Multi-task-8 B BAIR Robot Pushing 32.8
Transformer Multi-task-8 L BAIR Robot Pushing 22.8
Transformer Multi-task-10 B Something-Something-v2 43.4
Transformer Multi-task-10 L Something-Something-v2 27.3