This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ) and build knowledge about Video-VQVAE (VideoGPT) + DiT at scale. However, since we have limited resources, we deeply wish all open-source community can contribute to this project. Pull requests are welcome!!!
本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前我们资源有限仅搭建了基础架构,无法进行完整训练,希望通过开源社区逐步增加模块并筹集资源进行训练,当前版本离目标差距巨大,仍需持续完善和快速迭代,欢迎Pull request!!!
Project stages:
- Primary
- Setup the codebase and train a un-conditional model on a landscape dataset.
- Train models that boost resolution and duration.
- Extensions
- Conduct text2video experiments on landscape dataset.
- Train the 1080p model on video2text dataset.
- Control model with more conditions.
[2024.03.10] 🚀🚀🚀 This repo supports training a latent size of 225×90×90 (t×h×w), which means we are able to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition.
[2024.03.08] We support the training code of text condition with 16 frames of 512x512. The code is mainly borrowed from Latte.
[2024.03.07] We support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512.
[2024.03.05] See our latest todo, pull requests are welcome.
[2024.03.04] We re-organizes and modulizes our code to make it easy to contribute to the project, to contribute please see the Repo structure.
[2024.03.03] We opened some discussions to clarify several issues.
[2024.03.01] Training code is available now! Learn more on our project page. Please feel free to watch 👀 this repository for the latest updates.
- Fix typos & Update readme. 🤝 Thanks to @mio2333, @CreamyLong, @chg0901, @Nyx-177, @HowardLi1984, @sennnnn, @Jason-fan20
- Setup environment. 🤝 Thanks to @nameless1117
- Add docker file. ⌛ [WIP] 🤝 Thanks to @Mon-ius, @SimonLeeGit
- Enable type hints for functions. 🤝 Thanks to @RuslanPeresy, 🙏 [Need your contribution]
- Resume from checkpoint.
- Add Video-VQGAN model, which is borrowed from VideoGPT.
- Support variable aspect ratios, resolutions, durations training on DiT.
- Support Dynamic mask input inspired by FiT.
- Add class-conditioning on embeddings.
- Incorporating Latte as main codebase.
- Add VAE model, which is borrowed from Stable Diffusion.
- Joint dynamic mask input with VAE.
- Add VQVAE from VQGAN. 🙏 [Need your contribution]
- Make the codebase ready for the cluster training. Add SLURM scripts. 🙏 [Need your contribution]
- Refactor VideoGPT. 🤝 Thanks to @qqingzheng, @luo3300612, @sennnnn
- Add sampling script.
- Add DDP sampling script. ⌛ [WIP]
- Use accelerate on multi-node. 🤝 Thanks to @sysuyy
- Incorporate SiT. 🤝 Thanks to @khan-yin
- Add evaluation scripts (FVD, CLIP score). 🤝 Thanks to @rain305f
- Add PI to support out-of-domain size. 🤝 Thanks to @jpthu17
- Add 2D RoPE to improve generalization ability as FiT. 🤝 Thanks to @jpthu17
- Compress KV according to PixArt-sigma.
- Support deepspeed for videogpt training. 🤝 Thanks to @sennnnn
- Train a low dimension Video-AE, whether it is VAE or VQVAE. ⌛ [WIP] 🚀 [Require more computation]
- Extract offline feature.
- Train with offline feature.
- Add frame interpolation model. 🤝 Thanks to @yunyangge
- Add super resolution model. 🤝 Thanks to @Linzy19
- Add accelerate to automatically manage training.
- Joint training with images. 🙏 [Need your contribution]
- Implement MaskDiT technique for fast training. 🙏 [Need your contribution]
- Incorporate NaViT. 🙏 [Need your contribution]
- Add FreeNoise support for training-free longer video generation. 🙏 [Need your contribution]
- Implement PeRFlow for improving the sampling process. 🙏 [Need your contribution]
- Finish data loading, pre-processing utils.
- Add T5 support.
- Add CLIP support. 🤝 Thanks to @Ytimed2020
- Add text2image training script.
- Add prompt captioner.
- Collect training data.
- Need video-text pairs with caption. 🙏 [Need your contribution]
- Extract multi-frame descriptions by large image-language models. 🤝 Thanks to @HowardLi1984
- Extract video description by large video-language models. 🙏 [Need your contribution]
- Integrate captions to get a dense caption by using a large language model, such as GPT-4. 🤝 Thanks to @HowardLi1984
- Train a captioner to refine captions. 🚀 [Require more computation]
- Collect training data.
- Looking for a suitable dataset, welcome to discuss and recommend. 🙏 [Need your contribution]
- Add synthetic video created by game engines or 3D representations. 🙏 [Need your contribution]
- Finish data loading, and pre-processing utils. ⌛ [WIP]
- Support memory friendly training.
- Add flash-attention2 from pytorch.
- Add xformers. 🤝 Thanks to @jialin-zhao
- Support mixed precision training.
- Add gradient checkpoint.
- Support for ReBased and Ring attention. 🤝 Thanks to @kabachuha
- Train using the deepspeed engine. 🤝 Thanks to @sennnnn
- Integrate with Colossal-AI for a cheaper, faster, and more efficient. 🙏 [Need your contribution]
- Train with a text condition. Here we could conduct different experiments: 🚀 [Require more computation]
- Train with T5 conditioning.
- Train with CLIP conditioning.
- Train with CLIP + T5 conditioning (probably costly during training and experiments).
- Load pretrained weights from Latte.
- Incorporating ControlNet. 🙏 [Need your contribution]
├── README.md
├── docs
│ ├── Data.md -> Datasets description.
│ ├── Contribution_Guidelines.md -> Contribution guidelines description.
├── scripts -> All scripts.
├── opensora
│ ├── dataset
│ ├── models
│ │ ├── ae -> Compress videos to latents
│ │ │ ├── imagebase
│ │ │ │ ├── vae
│ │ │ │ └── vqvae
│ │ │ └── videobase
│ │ │ ├── vae
│ │ │ └── vqvae
│ │ ├── captioner
│ │ ├── diffusion -> Denoise latents
│ │ │ ├── diffusion
│ │ │ ├── dit
│ │ │ ├── latte
│ │ │ └── unet
│ │ ├── frame_interpolation
│ │ ├── super_resolution
│ │ └── text_encoder
│ ├── sample
│ ├── train -> Training code
│ └── utils
- Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
- Install required packages
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install optional requirements such as static type checking:
pip install -e '.[dev]'
Refer to Data.md
Refer to the document EVAL.md.
To train VQVAE, run the script:
scripts/videogpt/train_videogpt.sh
You can modify the training parameters within the script. For training parameters, please refer to transformers.TrainingArguments. Other parameters are explained as follows:
--embedding_dim
: number of dimensions for codebooks embeddings--n_codes 2048
: number of codes in the codebook--n_hiddens 240
: number of hidden features in the residual blocks--n_res_layers 4
: number of residual blocks--downsample "4,4,4"
: T H W downsampling stride of the encoder
--data_path <path>
: path to anhdf5
file or a folder containingtrain
andtest
folders with subdirectories of videos--resolution 128
: spatial resolution to train on--sequence_length 16
: temporal resolution, or video clip length
python examples/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
python examples/rec_video.py --video-path "assets/origin_video_1.mp4" --rec-path "rec_video_1.mp4" --resolution 196 --num-frames 600 --sample-rate 1
We present four reconstructed videos in this demonstration, arranged from left to right as follows:
3s 596x336 | 10s 256x256 | 18s 196x196 | 24s 168x96 |
---|---|---|---|
Please refer to the document VQVAE.
sh scripts/train.sh
The current resources are only enough for us to do primary experiments on the Sky dataset.
sh scripts/sample.sh
Below is a visualization of the sampling results.
12s 256x256 | 25s 256x256 |
---|---|
In comparison to the original implementation, we implement a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training, and pre-extracted features, xformers, deepspeed. Some data points using a batch size of 1 with a A100:
gradient checkpointing | mixed precision | xformers | feature pre-extraction | deepspeed config | compress kv | training speed | memory |
---|---|---|---|---|---|---|---|
✔ | ✔ | ✔ | ✔ | ❌ | ❌ | 0.64 steps/sec | 43G |
✔ | ✔ | ✔ | ✔ | Zero2 | ❌ | 0.66 steps/sec | 14G |
✔ | ✔ | ✔ | ✔ | Zero2 | ✔ | 0.66 steps/sec | 15G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ❌ | 0.33 steps/sec | 11G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ✔ | 0.31 steps/sec | 12G |
gradient checkpointing | mixed precision | xformers | feature pre-extraction | deepspeed config | compress kv | training speed | memory |
---|---|---|---|---|---|---|---|
✔ | ✔ | ✔ | ✔ | ❌ | ❌ | 0.08 steps/sec | 77G |
✔ | ✔ | ✔ | ✔ | Zero2 | ❌ | 0.08 steps/sec | 41G |
✔ | ✔ | ✔ | ✔ | Zero2 | ✔ | 0.09 steps/sec | 36G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ❌ | 0.07 steps/sec | 39G |
✔ | ✔ | ✔ | ✔ | Zero2 offload | ✔ | 0.07 steps/sec | 33G |
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines
- Latte: The main codebase we built upon and it is an wonderful video gererated model.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
- See LICENSE for details.