/Jengadev

Official Implementation: Training-Free Efficient Video Generation via Dynamic Token Carving

Primary LanguagePython

Jenga

This is the offical implementation of the paper Training-Free Efficient Video Generation via Dynamic Token Carving

Overview

Jenga can generate videos with 4.68-10.35 times faster on single GPU.

Please visit the project page for more video results.

Open-source Plan

  • Model Adaptation
    • HunyuanVideo Inference
    • Multi-gpus Parallel inference (Faster inference speed on more gpus)
    • HunyuanVideo-I2V Inference
    • Wan2.1
  • Engineering Optimization
    • Quantization
    • ComfyUI
    • RoPE & Norm Kernel
    • FA3 Adaptation

Guidance

Inference on HunyuanVideo

Enviornment

Following the installation as in HunyuanVideo:

# 1. Create conda environment
conda create -n Jenga python==3.10.9

# 2. Activate the environment
conda activate Jenga

# 3. Install PyTorch and other dependencies using conda
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r hy_requirements.txt

# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

# 6. Install xDiT for parallel inference (we test on H800, cuda124)
python -m pip install xfuser==0.4.3.post3
python -m pip install yunchang==0.6.3.post1

Download model

Please following the instruction in model_down_hy.md.

Single GPU Inference

bash scripts/hyvideo_jenga_base.sh # Jenga Base (Opt. 310s)
# bash scripts/hyvideo_jenga_turbo.sh # Jenga Turbo
# bash scripts/hyvideo_jenga_flash.sh # Jenga Flash
# bash scripts/hyvideo_jenga_3stage.sh # Jenga 3Stage 

Inference time for different settings (DiT time, single H800, after warmup):

HunyuanVideo Jenga-Base Jenga-Turbo Jenga-Flash Jenga-3Stage
1625s 310s (5.24x) 225s (7.22x) 184s (8.82x) 157s (10.35x)

If you want to type your prompt directly, just change the --prompt. Following command (for Jenga-Turbo)

If you encounters OOM issue, try to add --use-cpu-offload.

CUDA_VISIBLE_DEVICES=0 python3 -u ./jenga_hyvideo.py \
    --video-size 720 1280 \
    --video-length 125 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --seed 42 \
    --embedded-cfg-scale 6.0 \
    --flow-shift 7.0 \
    --flow-reverse \
    --sa-drop-rates 0.7 0.8 \
    --p-remain-rates 0.3 \
    --post-fix "Jenga_Turbo" \
    --save-path ./results/hyvideo \
    --res-rate-list 0.75 1.0 \
    --step-rate-list 0.5 1.0 \
    --scheduler-shift-list 7 9

Multi GPU Inference

We provide set of 8GPU runnable scripts (further 5-6x compared with single GPU):

bash scripts/hyvide_multigpu_jenga_base.sh 
# bash scripts/hyvide_multigpu_jenga_turbo.sh 
# bash scripts/hyvide_multigpu_jenga_flash.sh 
# bash scripts/hyvide_multigpu_jenga_3stage.sh 

For customizing (Jenga-Turbo as example):

export NPROC_PER_NODE=8
export ULYSSES_DEGREE=8 # number of GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=$NPROC_PER_NODE ./jenga_hyvideo_multigpu.py \
    --video-size 720 1280 \
    --video-length 125 \
    --infer-steps 50 \
    --prompt "The camera rotates around a large stack of vintage televisions all showing different programs -- 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery." \
    --seed 42 \
    --embedded-cfg-scale 6.0 \
    --flow-shift 7.0 \
    --flow-reverse \
    --sa-drop-rates 0.75 0.85 \
    --p-remain-rates 0.3 \
    --post-fix "Jenga_Turbo" \
    --save-path ./results/hyvideo_multigpu \
    --res-rate-list 0.75 1.0 \
    --step-rate-list 0.5 1.0 \
    --ulysses-degree $ULYSSES_DEGREE \
    --scheduler-shift-list 7 9

Inference time for different settings (DiT time, 8xH800, after warmup):

HunyuanVideo Jenga-Base Jenga-Turbo Jenga-Flash Jenga-3Stage
225s 55s (4.09x) 40s (5.62x) 38s (5.92x) 32s (7.03x)

Run Multiple Samples with Multi-GPU

Due to the constant time of VAE, we recommend allocating each prompt to a single card for batch sampling. Please check the sample script in Jenga-Turbo.

bash ./scripts/hyvideo_batched_sample.sh

Inference on AccVideo (Distilled Models)

The general pipeline is the same, just download weight from Huggingface to ckpts/AccVideo

Then run the script

bash ./scripts/accvideo_jenga.sh

Method Overview

The general idea of Jenga is to reduce token interactions in Diffusion Transformers (DiTs). Following is an overview.

The left part illustrates the attention carving. A 3D video latent is partitioned into local blocks before being passed to the Transformer layers. A block-wise attention is processed to get a head-aware sparse block-selection masks. In each selected block, dense parallel attention is performed. The right part illustrates the Progressive Resolution strategy. The number of tokens and timesteps is compressed to ensure an efficient generation.


Attention Carving (AttenCarve). Here we illustrate a toy example of a 4x4x4 latent, where m=8 latent items form a block. Left: The latent 3D re-ordering and block partition via space filling curves (SFC). Right: After the block-wise attention, we can construct the Importance Mask, combined with the pre-computed Condition Mask and Adjacency Mask, a block-wise dense attention mask is passed to the customized kernel for device-efficient attention.


Progressive Resolusion (ProRes). Left: A brief illustration of stage switch and timestep skip. Before the rescale in stage s, we revert the latent to a clean state $\hat{x}^{s}_0$, then re-noise on the upsampled clean latent. Right & Bottom: We add a bias on the video-text attention score, to enable a scalable Field of View (FOV) in low-resolution content generation.


Citation

If you find Jenga useful for your research and applications, please cite using this BibTeX:

@article{zhang2025trainingfreeefficientvideogeneration,
    title={Training-Free Efficient Video Generation via Dynamic Token Carving},
    author={Yuechen Zhang and Jinbo Xing and Bin Xia and Shaoteng Liu and Bohao Peng and Xin Tao and Pengfei Wan and Eric Lo and Jiaya Jia},
    journal={arXiv preprint arXiv:2505.16864},
    year={2025}
}

Acknowledgements

We would like to thank the contributors to the HunyuanVideo, HunyuanVideo-I2V, Wan2.1, AccVideo, MInference, Gilbert and HuggingFace repositories, for their open research and exploration. Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.