This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.
本项目希望通过开源社区的力量复现Sora,由北大-兔展AIGC联合实验室共同发起,当前版本离目标差距仍然较大,仍需持续完善和快速迭代,欢迎Pull request!目前代码同时支持使用国产AI计算系统(华为昇腾)进行完整的训练和推理。基于昇腾训练出的模型,也可输出持平业界的视频质量。
- [2024.08.13] 🎉 We are launching Open-Sora Plan v1.2.0 I2V model, which based on Open-Sora Plan v1.2.0. The current version supports image-to-video generation and transition generation (the starting and ending frames conditions for video generation). Checking out the Image-to-Video section in this report.
- [2024.07.24] 🔥🔥🔥 v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Checking out our latest report.
- [2024.05.27] 🎉 We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.
- [2024.04.09] 🤝 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.
- [2024.04.07] 🎉🎉🎉 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
- [2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
- [2024.03.01] 🤗 We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch 👀 this repository for the latest updates.
93×1280×720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.
video_24fps_compress.mp4 |
Open-Sora Plan shows excellent performance in video generation.
- High compression ratio with excellent performance, capable of compressing videos by 256 times (4×8×8). Causal convolution supports simultaneous inference of images and videos but only need 1 node to train.
- With a 3D full attention architecture instead of a 2+1D model, 3D attention can better capture joint spatial and temporal features.
Highly recommend trying out our web demo by the following command.
python -m opensora.serve.gradio_web_server --model_path "path/to/model" --ae_path "path/to/causalvideovae"
Coming soon...
Version | Architecture | Diffusion Model | CausalVideoVAE | Data |
---|---|---|---|---|
v1.2.0 | 3D | 93x720p, 29x720p[1], 93x480p[1,2], 29x480p, 1x480p, 93x480p_i2v | Anysize | Annotations |
v1.1.0 | 2+1D | 221x512x512, 65x512x512 | Anysize | Data and Annotations |
v1.0.0 | 2+1D | 65x512x512, 65x256x256, 17x256x256 | Anysize | Data and Annotations |
[1] Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.
[2] We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.
Warning
- Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
- Install required packages We recommend the requirements as follows.
- Python >= 3.8
- Pytorch >= 2.1.0
- CUDA Version >= 11.7
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
- Install optional requirements such as static type checking:
pip install -e '.[dev]'
The organization of the training data is easy. We only need to put all the videos recursively in a directory. This makes the training more convenient when using multiple datasets.
Training Dataset
|——sub_dataset1
|——sub_sub_dataset1
|——video1.mp4
|——video2.mp4
......
|——sub_sub_dataset2
|——video3.mp4
|——video4.mp4
......
|——sub_dataset2
|——video5.mp4
|——video6.mp4
......
|——video7.mp4
|——video8.mp4
bash scripts/causalvae/train.sh
We introduce the important args for training.
Argparse | Usage |
---|---|
Training size | |
--num_frames |
The number of using frames for training videos |
--resolution |
The resolution of the input to the VAE |
--batch_size |
The local batch size in each GPU |
--sample_rate |
The frame interval of when loading training videos |
Data processing | |
--video_path |
/path/to/dataset |
Load weights | |
--model_config |
/path/to/config.json The model config of VAE. If you want to train from scratch use this parameter. |
--pretrained_model_name_or_path |
A directory containing a model checkpoint and its config. Using this parameter will only load its weight but not load the state of the optimizer |
--resume_from_checkpoint |
/path/to/checkpoint It will resume the training process from the checkpoint including the weight and the optimizer. |
bash scripts/causalvae/rec_video.sh
We introduce the important args for inference.
Argparse | Usage |
---|---|
Ouoput video size | |
--num_frames |
The number of frames of generated videos |
--height |
The resolution of generated videos |
--width |
The resolution of generated videos |
Data processing | |
--video_path |
The path to the original video |
--rec_path |
The path to the generated video |
Load weights | |
--ae_path |
/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.json |
Other | |
--enable_tilintg |
Use tiling to deal with videos of high resolution and long duration |
--save_memory |
Save memory to inference but lightly influence quality |
For evaluation, you should save the original video clips by using --output_origin
.
bash scripts/causalvae/prepare_eval.sh
We introduce the important args for inference.
Argparse | Usage |
---|---|
Ouoput video size | |
--num_frames |
The number of frames of generated videos |
--resolution |
The resolution of generated videos |
Data processing | |
--real_video_dir |
The directory of the original videos. |
--generated_video_dir |
The directory of the generated videos. |
Load weights | |
--ckpt |
/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config. |
Other | |
--enable_tilintg |
Use tiling to deal with videos of high resolution and long time. |
--output_origin |
Output the original video clips, fed into the VAE. |
Then, we begin to eval. We introduce the important args in the script for evaluation.
bash scripts/causalvae/eval.sh
Argparse | Usage |
---|---|
--metric |
The metric, such as psnr, ssim, lpips |
--real_video_dir |
The directory of the original videos. |
--generated_video_dir |
The directory of the generated videos. |
We use a data.txt
file to specify all the training data. Each line in the file consists of DATA_ROOT
and DATA_JSON
. The example of data.txt
is as follows.
/path/to/data_root_1,/path/to/data_json_1.json
/path/to/data_root_2,/path/to/data_json_2.json
...
Then, we introduce the format of the annotation json file. The absolute data path is the concatenation of DATA_ROOT
and the "path"
field in the annotation json file.
The format of image annotation file is as follows.
[
{
"path": "00168/001680102.jpg",
"cap": [
"xxxxx."
],
"resolution": {
"height": 512,
"width": 683
}
},
...
]
The format of video annotation file is as follows. More details refer to HF dataset.
[
{
"path": "panda70m_part_5565/qLqjjDhhD5Q/qLqjjDhhD5Q_segment_0.mp4",
"cap": [
"A man and a woman are sitting down on a news anchor talking to each other."
],
"resolution": {
"height": 720,
"width": 1280
},
"fps": 29.97002997002997,
"duration": 11.444767
},
...
]
bash scripts/text_condition/gpu/train_t2v.sh
We introduce some key parameters in order to customize your training process.
Argparse | Usage |
---|---|
Training size | |
--num_frames 61 |
To train videos of different durations, e.g, 29, 61, 93, 125... |
--max_height 640 |
To train videos of different resolutions |
--max_width 480 |
To train videos of different resolutions |
Data processing | |
--data /path/to/data.txt |
Specify your training data. |
--speed_factor 1.25 |
To accelerate 1.25x videos. |
--drop_short_ratio 1.0 |
Do not want to train on videos of dynamic durations, discard all video data with frame counts not equal to --num_frames |
--group_frame |
If you want to train with videos of dynamic durations, we highly recommend specifying --group_frame as well. It improves computational efficiency during training. |
Multi-stage transfer learning | |
--interpolation_scale_h 1.0 |
When training a base model, such as 240p (--max_height 240 , --interpolation_scale_h 1.0 ) , and you want to initialize higher resolution models like 480p (height 480) from 240p's weights, you need to adjust --max_height 480 , --interpolation_scale_h 2.0 , and set --pretrained to your 240p weights path (path/to/240p/xxx.safetensors). |
--interpolation_scale_w 1.0 |
Same as --interpolation_scale_h 1.0 |
Load weights | |
--pretrained |
This is typically used for loading pretrained weights across stages, such as using 240p weights to initialize 480p training. Or when switching datasets and you do not want the previous optimizer state. |
--resume_from_checkpoint |
It will resume the training process from the latest checkpoint in --output_dir . Typically, we set --resume_from_checkpoint="latest" , which is useful in cases of unexpected interruptions during training. |
Sequence Parallelism | |
--sp_size 8 --train_sp_batch_size 2 |
It means running a batch size of 2 across 8 GPUs (8 GPUs on the same node). |
Warning
We provide multiple inference scripts to support various requirements. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method EulerAncestralDiscrete
for sampling.
Inference on 93×720p, we report speed on H100.
Size | 1 GPU | 8 GPUs (sp) |
---|---|---|
29×720p | 420s/100step | 80s/100step |
93×720p | 3400s/100step | 450s/100step |
If you only have one GPU, it will perform inference on each sample sequentially, one at a time.
bash scripts/text_condition/gpu/sample_t2v.sh
If you want to batch infer a large number of samples, each GPU will infer one sample.
bash scripts/text_condition/gpu/sample_t2v_ddp.sh
If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.
bash scripts/text_condition/gpu/sample_t2v_sp.sh
Same as Text-to-Video.
bash scripts/text_condition/gpu/train_inpaint.sh
In addition to the parameters shared with the Text-to-Video mode, there are some unique parameters specific to the Image-to-Video mode that you need to be aware of.
Argparse | Usage |
---|---|
Training size | |
--use_vae_preprocessed_mask |
Whether to use VAE (Variational Autoencoder) to encode the mask in order to achieve frame-level mask alignment. |
Data processing | |
--i2v_ratio 0.5 |
The proportion of training data allocated to executing the Image-to-Video task. |
--transition_ratio 0.4 |
The proportion of training data allocated to executing the transition task. |
--v2v_ratio 0.1 |
The proportion of training data allocated to executing the video continuation task. |
--default_text_ratio 0.5 |
When training with CFG (Classifier-Free Guidance) enabled, a portion of the text is replaced with default text, while another portion is set to an empty string. |
Load weights | |
--pretrained_transformer_model_path |
This parameter functions the same as the --pretrained parameter. |
In the current version, we have only open-sourced the 480p version of the Image-to-Video (I2V) model. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method PNDM
for sampling. Please note that due to the addition of frame-controllable fine-tuning, using the other samplers may not yield satisfactory results.
Inference on 93×480p, we report speed on H100.
Size | 1 GPU | 8 GPUs (sp) |
---|---|---|
93×480p | 427s/100step | 81s/100step |
Before inference, you need to create two text files: one named prompt.txt
and another named conditional_images_path.txt
. Each line of text in prompt.txt
should correspond to the paths on each line in conditional_images_path.txt
.
For example, if the content of prompt.txt
is:
this is a prompt of i2v task.
this is a prompt of transition task.
Then the content of conditional_images_path should be:
/path/to/image_0.png
/path/to/image_1_0.png,/path/to/image_1_1.png
This means we will execute a image-to-video task using /path/to/image_0.png
and "this is a prompt of i2v task." For the transition task, we'll use /path/to/image_1_0.png
and /path/to/image_1_1.png
(note that these two paths are separated by a comma without any spaces) along with "this is a prompt of transition task."
After creating the files, make sure to specify their paths in the sample_inpaint.sh
script.
If you only have one GPU, it will perform inference on each sample sequentially, one at a time.
bash scripts/text_condition/gpu/sample_inpaint.sh
If you want to batch infer a large number of samples, each GPU will infer one sample.
bash scripts/text_condition/gpu/sample_inpaint_ddp.sh
If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.
bash scripts/text_condition/gpu/sample_inpaint_sp.sh
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines
- Latte: It is an wonderful 2+1D video generated model.
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
- See LICENSE for details.
@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
author = {PKU-Yuan Lab and Tuzhan AI etc.},
title = {Open-Sora-Plan},
month = apr,
year = 2024,
publisher = {GitHub},
doi = {10.5281/zenodo.10948109},
url = {https://doi.org/10.5281/zenodo.10948109}
}