NUS-HPC-AI-Lab/VideoSys

Questions about longseq

yhy-2000 opened this issue · 3 comments

Hi, thanks for your great work.

I've conducted tests on two A40 machines utilizing accelerators. Executing the following command:

torchrun --standalone --nproc_per_node=2 train.py \
    --model DiT-XL/2 \
    --batch_size 3 \
    --num_classes 10

I observed that each epoch takes approximately 1.5 hours to complete, as indicated by: Epoch 0: 3%|▎ | 284/8333 [03:15<1:30:18, 1.49it/s, loss=0.173, step=283, global_step=283].

Conversely, after implementing all acceleration techniques outlined in the readme, using the command:
However, when I apply all the accelerate operations mentioned in readme

torchrun --standalone --nproc_per_node=2 train.py \
    --model DiT-XL/2 \
    --batch_size 3 \
    --num_classes 10 \
    --sequence_parallel_type longseq \
    --sequence_parallel_size 2 \
    --enable_modulate_kernel \
    --enable_flashattn \
    --enable_layernorm_kernel

the duration for each epoch doubled to 3 hours: Epoch 0: 0%| | 34/16666 [06:33<3:17:52, 1.40it/s, loss=0.943, step=33, global_step=33].

Could you please explain the reason behind this increased processing time?

do not use sequence parallelism if not necessary. And add batch size as much as you can. you can follow our instructions in the readme for more details.

Thank you for the guidance. However, I'm curious about the '80% speedup' mentioned in the readme. Could you clarify how to achieve this performance improvement?

enable all kernels (except modulate kernel because it has accuracy problem now) and use as large batch size as you can