We present Open-Sora, an initiative dedicated to efficiently produce high-quality video and make the model, tools and contents accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video production. With Open-Sora, we aim to inspire innovation, creativity, and inclusivity in the realm of content creation. [中文]
- [2024.03.18] 🔥 We release Open-Sora 1.0, a fully open-source project for video generation. Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with acceleration, inference, and more. Our provided checkpoints can produce 2s 512x512 videos with only 3 days training. [blog]
- [2024.03.04] Open-Sora provides training with 46% cost reduction. [blog]
Videos are downsampled to .gif
for display. Click for original videos. Prompts are trimmed for display, see here for full prompts. See more samples at our gallery.
- 📍 Open-Sora-v1 released. Model weights are available here. With only 400K video clips and 200 H800 days (compared with 152M samples in Stable Video Diffusion), we are able to generate 2s 512×512 videos.
- ✅ Three stages training from an image diffusion model to a video diffusion model. We provide the weights for each stage.
- ✅ Support training acceleration including accelerated transformer, faster T5 and VAE, and sequence parallelism. Open-Sora improve 55% training speed when training on 64x512x512 videos. Details locates at acceleration.md.
- ✅ We provide data preprocessing pipeline, including downloading, video cutting, and captioning tools. Our data collection plan can be found at datasets.md.
- ✅ We find VQ-VAE from VideoGPT has a low quality and thus adopt a better VAE from Stability-AI. We also find patching in the time dimension deteriorates the quality. See our report for more discussions.
- ✅ We investigate different architectures including DiT, Latte, and our proposed STDiT. Our STDiT achieves a better trade-off between quality and speed. See our report for more discussions.
- ✅ Support clip and T5 text conditioning.
- ✅ By viewing images as one-frame videos, our project supports training DiT on both images and videos (e.g., ImageNet & UCF101). See command.md for more instructions.
- ✅ Support inference with official weights from DiT, Latte, and PixArt.
View more
- ✅ Refactor the codebase. See structure.md to learn the project structure and how to use the config files.
- Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, deduplication, etc.). See datasets.md for more information. [WIP]
- Training Video-VAE. [WIP]
View more
- Support image and video conditioning.
- Evaluation pipeline.
- Incoporate a better scheduler, e.g., rectified flow in SD3.
- Support variable aspect ratios, resolutions, durations.
- Support SD3 when released.
# create a virtual env
conda create -n opensora python=3.10
# install torch
# the command below is for CUDA 12.1, choose install commands from
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision
# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation
# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
After installation, we suggest reading structure.md to learn the project structure and how to use the config files.
Resolution | Data | #iterations | Batch Size | GPU days (H800) | URL |
---|---|---|---|---|---|
16×256×256 | 366K | 80k | 8×64 | 117 | 🔗 |
16×256×256 | 20K HQ | 24k | 8×64 | 45 | 🔗 |
16×512×512 | 20K HQ | 20k | 2×64 | 35 | 🔗 |
Our model's weight is partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in our report. More about dataset can be found in dataset.md. HQ means high quality.
To run inference with our provided weights, first download T5 weights into pretrained_models/t5_ckpts/t5-v1_1-xxl
. Then download the model weights from huggingface. Run the following commands to generate samples. To change sampling prompts, modify the txt file passed to --prompt-path
. See here to customize the configuration.
# Sample 16x256x256 (5s/sample, 100 time steps, 22 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./asserts/texts/t2v_samples.txt
# Auto Download
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path OpenSora-v1-HQ-16x256x256.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 16x512x512 (20s/sample, 100 time steps, 24 GB memory)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./asserts/texts/t2v_samples.txt
# Auto Download
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path OpenSora-v1-HQ-16x512x512.pth --prompt-path ./assets/texts/t2v_samples.txt
# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./asserts/texts/t2v_samples.txt
# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth --prompt-path ./asserts/texts/t2v_samples.txt
The speed is tested on H800 GPUs. For inference with other models, see here for more instructions. To lower the memory usage, set a smaller vae.micro_batch_size
in the config (slightly lower sampling speed).
High-quality Data is the key to high-quality models. Our used datasets and data collection plan is here. We provide tools to process video data. Currently, our data processing pipeline includes the following steps:
To launch training, first download T5 weights into pretrained_models/t5_ckpts/t5-v1_1-xxl
. Then run the following commands to launch training on a single node.
# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
To launch training on multiple nodes, prepare a hostfile according to ColossalAI, and run the following commands.
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
For training other models and advanced usage, see here for more instructions.
Thanks goes to these wonderful contributors (emoji key following all-contributors specification):
zhengzangw 💻 📖 🤔 📹 🚧 |
ver217 💻 🤔 📖 🐛 |
FrankLeeeee 💻 🚇 🔧 |
xyupeng 💻 📖 🎨 |
Yanjia0 📖 |
binmakeswell 📖 |
eltociear 📖 |
ganeshkrishnan1 📖 |
fastalgo 📖 |
powerzbt 📖 |
If you wish to contribute to this project, you can refer to the Contribution Guideline.
- ColossalAI: A powerful large model parallel acceleration and optimization system.
- DiT: Scalable Diffusion Models with Transformers.
- OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
- PixArt: An open-source DiT-based text-to-image model.
- Latte: An attempt to efficiently train DiT for video.
- StabilityAI VAE: A powerful image VAE model.
- CLIP: A powerful text-image embedding model.
- T5: A powerful text encoder.
- LLaVA: A powerful image captioning model based on Yi-34B.
We are grateful for their exceptional work and generous contribution to open source.
@software{opensora,
author = {Zangwei Zheng and Xiangyu Peng and Yang You},
title = {Open-Sora: Democratizing Efficient Video Production for All},
month = {March},
year = {2024},
url = {https://github.com/hpcaitech/Open-Sora}
}
Zangwei Zheng and Xiangyu Peng equally contributed to this work during their internship at HPC-AI Tech.