/MakeLongVideo

Implementation of long video generation

Primary LanguagePythonMIT LicenseMIT

MakeLongVideo - Pytorch

Implementation of long video generation based on diffusion model.

"Ironman is surfing" "a car is racing" "a cat eating food of a bowl, in von Gogh style" "a giraffe underneath the microwave"
"a glass bead falling into water with huge splash" "a video of Earth rotating in space" "A teddy bear running in New York City" "A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset"

Change Logs

  • [07/23/2023] LAION400M did not help too much, so I collected another 100m video/text pairs except 2M webvid dataset. Part of them are watermark free. After 2~3 months training, result seems not bad. I will release watermark free checkpoint soon. Training on RTX3090 2GPUs for video generation task is really a pain.

Setup

Requirements

python3 -m pip install -r requirements.txt

Training

Prepare Stable Diffusion v1-4 pretrained weights

download from huggingface and put it in directory 'checkpoints' which is configured in configs/makelongvideo.yaml

Download webvid dataset

download webvid dataset into directory 'data/webvid' using https://github.com/m-bain/webvid repo. Then prepare dataset using command

python3 genvideocap.py

Download LAION400M dataset

download laion400m into directory 'data/laion400m'

Train

first train using resolution 128x128

accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo.yaml

then finetune in resolution 256x256, modify last line of configs/makelongvideo256x256.yaml according to your local epoch checkpoint

accelerate launch --config_file ./configs/multigpu.yaml train.py --config configs/makelongvideo256x256.yaml

Inference

Pretrained weights: https://huggingface.co/xiexiecn/MakeLongVideo

# unwrap checkpoint first
TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch train.py --config configs/makelongvideo.yaml --unwrap ./outputs/makelongvideo/checkpoint-5200

inference directly

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing"

inference using latents initialized by sample video

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing" --sample_video_path your_sample_video

inference by sample frame rate 6 (actual frame rate is 24/6==4)

python3 infer.py  --width 256 --height 256 --prompt "a panda is surfing" --speed 6

Todo

  • generate 24 frames video of 256x256
  • add fps control
  • release pretrained checkpoint
  • remove watermark
  • improve resolution to 512x512
  • 1~2minutes video generation
  • make story video

References

Citations

@misc{Singer2022,
    author  = {Uriel Singer},
    url     = {https://makeavideo.studio/Make-A-Video.pdf}
}
@article{wu2022tuneavideo,
    title   = {Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author  = {Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},
    year    = {2022},
    note    = {under review}
}