Open-Sora Plan

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前版本离目标差距仍然较大，仍需持续完善和快速迭代，欢迎Pull request！目前代码同时支持使用国产AI计算系统（华为昇腾）进行完整的训练和推理。基于昇腾训练出的模型，也可输出持平业界的视频质量。

If you like our project, please give us a star ⭐ on GitHub for latest update.

📣 News

[2024.08.13] 🎉 We are launching Open-Sora Plan v1.2.0 I2V model, which based on Open-Sora Plan v1.2.0. The current version supports image-to-video generation and transition generation (the starting and ending frames conditions for video generation). Checking out the Image-to-Video section in this report.
[2024.07.24] 🔥🔥🔥 v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Checking out our latest report.
[2024.05.27] 🎉 We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.
[2024.04.09] 🤝 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.
[2024.04.07] 🎉🎉🎉 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
[2024.03.01] 🤗 We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch 👀 this repository for the latest updates.

😍 Gallery

93×1280×720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.

video_24fps_compress.mp4

😮 Highlights

Open-Sora Plan shows excellent performance in video generation.

🔥 High performance CausalVideoVAE, but with fewer training cost

High compression ratio with excellent performance, capable of compressing videos by 256 times (4×8×8). Causal convolution supports simultaneous inference of images and videos but only need 1 node to train.

🚀 Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.

With a 3D full attention architecture instead of a 2+1D model, 3D attention can better capture joint spatial and temporal features.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command.

python -m opensora.serve.gradio_web_server --model_path "path/to/model" --ae_path "path/to/causalvideovae"

ComfyUI

Coming soon...

🐳 Resource

Version	Architecture	Diffusion Model	CausalVideoVAE	Data
v1.2.0	3D	93x720p, 29x720p[1], 93x480p[1,2], 29x480p, 1x480p, 93x480p_i2v	Anysize	Annotations
v1.1.0	2+1D	221x512x512, 65x512x512	Anysize	Data and Annotations
v1.0.0	2+1D	65x512x512, 65x256x256, 17x256x256	Anysize	Data and Annotations

[1] Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.

[2] We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.

Warning

🚨 For version 1.2.0, we no longer support 2+1D models.

⚙️ Requirements and Installation

Clone this repository and navigate to Open-Sora-Plan folder

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan

Install required packages We recommend the requirements as follows.

Python >= 3.8
Pytorch >= 2.1.0
CUDA Version >= 11.7

conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"

Install optional requirements such as static type checking:

pip install -e '.[dev]'

🗝️ Training & Validating

🗜️ CausalVideoVAE

Data prepare

The organization of the training data is easy. We only need to put all the videos recursively in a directory. This makes the training more convenient when using multiple datasets.

Training Dataset
|——sub_dataset1
    |——sub_sub_dataset1
        |——video1.mp4
        |——video2.mp4
        ......
    |——sub_sub_dataset2
        |——video3.mp4
        |——video4.mp4
        ......
|——sub_dataset2
    |——video5.mp4
    |——video6.mp4
    ......
|——video7.mp4
|——video8.mp4

Training

bash scripts/causalvae/train.sh

We introduce the important args for training.

Argparse	Usage
Training size
`--num_frames`	The number of using frames for training videos
`--resolution`	The resolution of the input to the VAE
`--batch_size`	The local batch size in each GPU
`--sample_rate`	The frame interval of when loading training videos
Data processing
`--video_path`	/path/to/dataset
Load weights
`--model_config`	/path/to/config.json The model config of VAE. If you want to train from scratch use this parameter.
`--pretrained_model_name_or_path`	A directory containing a model checkpoint and its config. Using this parameter will only load its weight but not load the state of the optimizer
`--resume_from_checkpoint`	/path/to/checkpoint It will resume the training process from the checkpoint including the weight and the optimizer.

Inference

bash scripts/causalvae/rec_video.sh

We introduce the important args for inference.

Argparse	Usage
Ouoput video size
`--num_frames`	The number of frames of generated videos
`--height`	The resolution of generated videos
`--width`	The resolution of generated videos
Data processing
`--video_path`	The path to the original video
`--rec_path`	The path to the generated video
Load weights
`--ae_path`	/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.json
Other
`--enable_tilintg`	Use tiling to deal with videos of high resolution and long duration
`--save_memory`	Save memory to inference but lightly influence quality

Evaluation

For evaluation, you should save the original video clips by using --output_origin.

bash scripts/causalvae/prepare_eval.sh

We introduce the important args for inference.

Argparse	Usage
Ouoput video size
`--num_frames`	The number of frames of generated videos
`--resolution`	The resolution of generated videos
Data processing
`--real_video_dir`	The directory of the original videos.
`--generated_video_dir`	The directory of the generated videos.
Load weights
`--ckpt`	/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.
Other
`--enable_tilintg`	Use tiling to deal with videos of high resolution and long time.
`--output_origin`	Output the original video clips, fed into the VAE.

Then, we begin to eval. We introduce the important args in the script for evaluation.

bash scripts/causalvae/eval.sh

Argparse	Usage
`--metric`	The metric, such as psnr, ssim, lpips
`--real_video_dir`	The directory of the original videos.
`--generated_video_dir`	The directory of the generated videos.

📜 Text-to-Video

Data prepare

We use a data.txt file to specify all the training data. Each line in the file consists of DATA_ROOT and DATA_JSON. The example of data.txt is as follows.

/path/to/data_root_1,/path/to/data_json_1.json
/path/to/data_root_2,/path/to/data_json_2.json
...

Then, we introduce the format of the annotation json file. The absolute data path is the concatenation of DATA_ROOT and the "path" field in the annotation json file.

For image

The format of image annotation file is as follows.

[
  {
    "path": "00168/001680102.jpg",
    "cap": [
      "xxxxx."
    ],
    "resolution": {
      "height": 512,
      "width": 683
    }
  },
  ...
]

For video

The format of video annotation file is as follows. More details refer to HF dataset.

[
  {
    "path": "panda70m_part_5565/qLqjjDhhD5Q/qLqjjDhhD5Q_segment_0.mp4",
    "cap": [
      "A man and a woman are sitting down on a news anchor talking to each other."
    ],
    "resolution": {
      "height": 720,
      "width": 1280
    },
    "fps": 29.97002997002997,
    "duration": 11.444767
  },
  ...
]

Training

bash scripts/text_condition/gpu/train_t2v.sh

We introduce some key parameters in order to customize your training process.

Argparse	Usage
Training size
`--num_frames 61`	To train videos of different durations, e.g, 29, 61, 93, 125...
`--max_height 640`	To train videos of different resolutions
`--max_width 480`	To train videos of different resolutions
Data processing
`--data /path/to/data.txt`	Specify your training data.
`--speed_factor 1.25`	To accelerate 1.25x videos.
`--drop_short_ratio 1.0`	Do not want to train on videos of dynamic durations, discard all video data with frame counts not equal to `--num_frames`
`--group_frame`	If you want to train with videos of dynamic durations, we highly recommend specifying `--group_frame` as well. It improves computational efficiency during training.
Multi-stage transfer learning
`--interpolation_scale_h 1.0`	When training a base model, such as 240p (`--max_height 240`, `--interpolation_scale_h 1.0`) , and you want to initialize higher resolution models like 480p (height 480) from 240p's weights, you need to adjust `--max_height 480`, `--interpolation_scale_h 2.0`, and set `--pretrained` to your 240p weights path (path/to/240p/xxx.safetensors).
`--interpolation_scale_w 1.0`	Same as `--interpolation_scale_h 1.0`
Load weights
`--pretrained`	This is typically used for loading pretrained weights across stages, such as using 240p weights to initialize 480p training. Or when switching datasets and you do not want the previous optimizer state.
`--resume_from_checkpoint`	It will resume the training process from the latest checkpoint in `--output_dir`. Typically, we set `--resume_from_checkpoint="latest"`, which is useful in cases of unexpected interruptions during training.
Sequence Parallelism
`--sp_size 8 --train_sp_batch_size 2`	It means running a batch size of 2 across 8 GPUs (8 GPUs on the same node).

Warning

🚨 We have two ways to load weights: `--pretrained` and `--resume_from_checkpoint`. The latter will override the former.

Inference

We provide multiple inference scripts to support various requirements. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method EulerAncestralDiscrete for sampling.

Inference on 93×720p, we report speed on H100.

Size	1 GPU	8 GPUs (sp)
29×720p	420s/100step	80s/100step
93×720p	3400s/100step	450s/100step

🖥️ 1 GPU

If you only have one GPU, it will perform inference on each sample sequentially, one at a time.

bash scripts/text_condition/gpu/sample_t2v.sh

🖥️🖥️ Multi-GPUs

If you want to batch infer a large number of samples, each GPU will infer one sample.

bash scripts/text_condition/gpu/sample_t2v_ddp.sh

🖥️🖥️ Multi-GPUs & Sequence Parallelism

If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.

bash scripts/text_condition/gpu/sample_t2v_sp.sh

🖼️ Image-to-Video

Data prepare

Same as Text-to-Video.

Training

bash scripts/text_condition/gpu/train_inpaint.sh

In addition to the parameters shared with the Text-to-Video mode, there are some unique parameters specific to the Image-to-Video mode that you need to be aware of.

Argparse	Usage
Training size
`--use_vae_preprocessed_mask`	Whether to use VAE (Variational Autoencoder) to encode the mask in order to achieve frame-level mask alignment.
Data processing
`--i2v_ratio 0.5`	The proportion of training data allocated to executing the Image-to-Video task.
`--transition_ratio 0.4`	The proportion of training data allocated to executing the transition task.
`--v2v_ratio 0.1`	The proportion of training data allocated to executing the video continuation task.
`--default_text_ratio 0.5`	When training with CFG (Classifier-Free Guidance) enabled, a portion of the text is replaced with default text, while another portion is set to an empty string.
Load weights
`--pretrained_transformer_model_path`	This parameter functions the same as the `--pretrained` parameter.

Inference

In the current version, we have only open-sourced the 480p version of the Image-to-Video (I2V) model. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method PNDM for sampling. Please note that due to the addition of frame-controllable fine-tuning, using the other samplers may not yield satisfactory results.

Inference on 93×480p, we report speed on H100.

Size	1 GPU	8 GPUs (sp)
93×480p	427s/100step	81s/100step

Before inference, you need to create two text files: one named prompt.txt and another named conditional_images_path.txt. Each line of text in prompt.txt should correspond to the paths on each line in conditional_images_path.txt.

For example, if the content of prompt.txt is:

this is a prompt of i2v task.
this is a prompt of transition task.

Then the content of conditional_images_path should be:

/path/to/image_0.png
/path/to/image_1_0.png,/path/to/image_1_1.png

This means we will execute a image-to-video task using /path/to/image_0.png and "this is a prompt of i2v task." For the transition task, we'll use /path/to/image_1_0.png and /path/to/image_1_1.png (note that these two paths are separated by a comma without any spaces) along with "this is a prompt of transition task."

After creating the files, make sure to specify their paths in the sample_inpaint.sh script.

🖥️ 1 GPU

If you only have one GPU, it will perform inference on each sample sequentially, one at a time.

bash scripts/text_condition/gpu/sample_inpaint.sh

🖥️🖥️ Multi-GPUs

If you want to batch infer a large number of samples, each GPU will infer one sample.

bash scripts/text_condition/gpu/sample_inpaint_ddp.sh

🖥️🖥️ Multi-GPUs & Sequence Parallelism

If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.

bash scripts/text_condition/gpu/sample_inpaint_sp.sh

💡 How to Contribute

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

Latte: It is an wonderful 2+1D video generated model.
PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
VideoGPT: Video Generation using VQ-VAE and Transformers.
DiT: Scalable Diffusion Models with Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

See LICENSE for details.

✏️ Citing

BibTeX

@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
  author       = {PKU-Yuan Lab and Tuzhan AI etc.},
  title        = {Open-Sora-Plan},
  month        = apr,
  year         = 2024,
  publisher    = {GitHub},
  doi          = {10.5281/zenodo.10948109},
  url          = {https://doi.org/10.5281/zenodo.10948109}
}

JerryWei1985/Open-Sora-Plan

Open-Sora Plan

If you like our project, please give us a star ⭐ on GitHub for latest update.

📣 News

😍 Gallery

😮 Highlights

🔥 High performance CausalVideoVAE, but with fewer training cost

🚀 Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.

🤗 Demo

Gradio Web UI

ComfyUI

🐳 Resource

⚙️ Requirements and Installation

🗝️ Training & Validating

🗜️ CausalVideoVAE

Data prepare

Training

Inference

Evaluation

📜 Text-to-Video

Data prepare

For image

For video

Training

Inference

🖥️ 1 GPU

🖥️🖥️ Multi-GPUs

🖥️🖥️ Multi-GPUs & Sequence Parallelism

🖼️ Image-to-Video

Data prepare

Training

Inference

🖥️ 1 GPU

🖥️🖥️ Multi-GPUs

🖥️🖥️ Multi-GPUs & Sequence Parallelism

💡 How to Contribute

👍 Acknowledgement

🔒 License

✏️ Citing

BibTeX

Latest DOI

🤝 Community contributors