This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "CloseAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, we have limited resources, we deeply wish all open-source community can contribute to this project. Pull request are welcome!!!
本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前我们资源有限仅搭建了基础架构,无法进行完整训练,希望通过开源社区逐步增加模块并筹集资源进行训练,当前版本离目标差距巨大,仍需持续完善和快速迭代,欢迎Pull request!!!
[2024.03.04] We re-organize and modulize our codes and make it easy to contribute to the project, please see the Repo structure.
[2024.03.03] We open some discussions and clarify several issues.
[2024.03.01] Training codes are available now! Learn more in our project page. Please feel free to watch 👀 this repository for the latest updates.
-
support variable aspect ratios, resolutions, durations training on DiT
-
dynamic mask input
-
add class-conditioning on embeddings
-
sampling script
-
add positional interpolation
-
fine-tune Video-VQVAE on higher resolution
-
incorporating SiT
-
incorporating more conditions
-
training with more data and more GPU
├── README.md
├── docs
│ ├── data.md -> Datasets description.
├── scripts -> All training scripts.
│ └── train.sh
├── sora
│ ├── dataset -> Dataset code to read videos
│ ├── models
│ │ ├── captioner
│ │ ├── super_resolution
│ ├── modules
│ │ ├── ae -> compress videos to latents
│ │ │ ├── vqvae
│ │ │ ├── vae
│ │ ├── diffusion -> denoise latents
│ │ │ ├── diffusion_2d
│ │ │ ├── diffusion_3d
| ├── utils.py
│ ├── train.py -> Training code
The recommended requirements are as follows.
- Python >= 3.8
- Pytorch >= 1.13.1
- CUDA Version >= 11.7
- Install required packages:
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
cd VideoGPT
pip install -e .
cd ..
Refer to data.md
Refer to origin repo. Use the scripts/train_vqvae.py
script to train a Video-VQVAE. Execute python scripts/train_vqvae.py -h
for information on all available training settings. A subset of more relevant settings are listed below, along with default values.
cd VideoGPT
--embedding_dim
: number of dimensions for codebooks embeddings--n_codes 2048
: number of codes in the codebook--n_hiddens 240
: number of hidden features in the residual blocks--n_res_layers 4
: number of residual blocks--downsample 4 4 4
: T H W downsampling stride of the encoder
--gpus 2
: number of gpus for distributed training--sync_batchnorm
: usesSyncBatchNorm
instead ofBatchNorm3d
when using > 1 gpu--gradient_clip_val 1
: gradient clipping threshold for training--batch_size 16
: batch size per gpu--num_workers 8
: number of workers for each DataLoader
--data_path <path>
: path to anhdf5
file or a folder containingtrain
andtest
folders with subdirectories of videos--resolution 128
: spatial resolution to train on--sequence_length 16
: temporal resolution, or video clip length
python VideoGPT/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
python VideoGPT/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1
We present four reconstructed videos in this demonstration, arranged from left to right as follows:
3s 596x336 | 10s 256x256 | 18s 196x196 | 24s 168x96 |
---|---|---|---|
sh scripts/train.sh
Coming soon.
- DiT: Scalable Diffusion Models with Transformers.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
- The service is a research preview intended for non-commercial use only. See LICENSE.txt for details.