/OpenDiT

OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference

Primary LanguagePythonApache License 2.0Apache-2.0

OpenDiT

An Easy, Fast and Memory-Efficient System for DiT Training and Inference

[Homepage] | [Discord] | [WeChat] | [Twitter] | [Zhihu] | [Media]

Latest News 🔥

  • [2024/06] Propose Pyramid Attention Broadcast (PAB)[blog][doc], the first approach to achieve real-time DiT-based video generation, delivering lossless quality without requiring any training.
  • [2024/06] Support OpenSora, Open-Sora-Plan and Latte.
  • [2024/03] Propose Dynamic Sequence Parallel (DSP)[paper][doc], achieves 3x speed for training and 2x speed for inference in OpenSora compared with sota sequence parallelism.
  • [2024/03] Support OpenSora: Democratizing Efficient Video Production for All.
  • [2024/02] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.

About

OpenDiT is an open-source project that provides a high-performance implementation of Diffusion Transformer (DiT) powered by Colossal-AI, specifically designed to enhance the efficiency of training and inference for DiT applications, including text-to-video generation and text-to-image generation.

OpenDiT has been adopted by: OpenSora, MiniSora, SpeeDiT.

OpenDiT boasts the performance by the following techniques:

  1. Up to 80% speedup and 50% memory reduction on GPU
    • Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
    • Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
  2. FastSeq: A novel sequence parallelism method
    • Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
    • Up to 48% communication save for intra-node sequence parallel.
    • Break the memory limitation of a single GPU and reduce the overall training and inference time.
  3. Ease of use
    • Huge performance improvement gains with a few line changes
    • Users do not need to know the implementation of distributed training.
  4. Complete pipeline of text-to-image and text-to-video generation
    • Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
    • Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.

end2end

Authors: Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, Yang You

OpenDiT will continue to integrate more open-source DiT models. Stay tuned for upcoming enhancements and additional features!

Installation

Prerequisites:

  • Python >= 3.10
  • PyTorch >= 1.13 (We recommend to use a >2.0 version)
  • CUDA >= 11.6

We strongly recommend using Anaconda to create a new environment (Python >= 3.10) to run our examples:

conda create -n opendit python=3.10 -y
conda activate opendit

Install ColossalAI:

pip install colossalai==0.3.7

Install OpenDiT:

git clone https://github.com/NUS-HPC-AI-Lab/OpenDiT
cd OpenDiT
pip install -e .

Usage

OpenDiT fully supports the following models, including training and inference, which align with the original methods. Through our novel techniques, we enable these models to run faster and consume less memory. Here's how you can use them:

Model Train Inference Optimize Usage
DiT[source] Doc
Open-Sora[source] 🟡 Doc
Latte[source] Doc
Open-Sora-Plan[source] Doc

Technique Overview

Pyramid Attention Broadcast (PAB) [blog][doc]

Real-Time Video Generation with Pyramid Attention Broadcast

Authors: Xuanlei Zhao1*, Xiaolong Jin2*, Kai Wang1*, and Yang You1 (* indicates equal contribution)

1National University of Singapore, 2Purdue University

method

PAB is the first approach to achieve real-time DiT-based video generation, delivering lossless quality without requiring any training.

By mitigating redundant attention computation, PAB achieves up to 21.6 FPS with 10.6x acceleration, without sacrificing quality across popular DiT-based video generation models including Open-Sora, Open-Sora-Plan, and Latte.

Notably, as a training-free approach, PAB can enpower any future DiT-based video generation models with real-time capabilities.

See its detail and usage here.


Dyanmic Sequence Parallelism (DSP) [paper][doc]

dsp_overview

DSP is a novel, elegant and super efficient sequence parallelism for OpenSora, Latte and other multi-dimensional transformer architecture.

It achieves 3x speed for training and 2x speed for inference in OpenSora compared with sota sequence parallelism (DeepSpeed Ulysses). For a 10s (80 frames) of 512x512 video, the inference latency of OpenSora is:

Method 1xH800 8xH800 (DS Ulysses) 8xH800 (DSP)
Latency(s) 106 45 22

See its detail and usage here.


DiT Reproduction Result

We have trained DiT using the origin method with OpenDiT to verify our accuracy. We have trained the model from scratch on ImageNet for 80k steps on 8xA100. Here are some results generated by our trained DiT:

Results

Our loss also aligns with the results listed in the paper:

Loss

To reproduce our results, you can follow our instruction.

Acknowledgement

We extend our gratitude to Zangwei Zheng for providing valuable insights into algorithms and aiding in the development of the video pipeline. Additionally, we acknowledge Shenggan Cheng for his guidance on code optimization and parallelism. Our appreciation also goes to Fuzhao Xue, Shizun Wang, Yuchao Gu, Shenggui Li, and Haofan Wang for their invaluable advice and contributions.

This codebase borrows from:

  • Open-Sora: Democratizing Efficient Video Production for All.
  • DiT: Scalable Diffusion Models with Transformers.
  • PixArt: An open-source DiT-based text-to-image model.
  • Latte: An attempt to efficiently train DiT for video.

Contributing

If you encounter problems using OpenDiT or have a feature request, feel free to create an issue! We also welcome pull requests from the community.

Citation

@misc{zhao2024opendit,
  author = {Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You},
  title = {OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/NUS-HPC-AI-Lab/OpenDiT}},
}

@misc{zhao2024dsp,
      title={DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers},
      author={Xuanlei Zhao and Shenggan Cheng and Zangwei Zheng and Zheming Yang and Ziming Liu and Yang You},
      year={2024},
      eprint={2403.10266},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Star History

Star History Chart