Questions about longseq

Question

Questions about longseq

yhy-2000 opened this issue 10 months ago · 3 comments

yhy-2000 commented 10 months ago

Hi, thanks for your great work.

I've conducted tests on two A40 machines utilizing accelerators. Executing the following command:

torchrun --standalone --nproc_per_node=2 train.py \
    --model DiT-XL/2 \
    --batch_size 3 \
    --num_classes 10

I observed that each epoch takes approximately 1.5 hours to complete, as indicated by: Epoch 0: 3%|▎ | 284/8333 [03:15<1:30:18, 1.49it/s, loss=0.173, step=283, global_step=283].

Conversely, after implementing all acceleration techniques outlined in the readme, using the command:
However, when I apply all the accelerate operations mentioned in readme

torchrun --standalone --nproc_per_node=2 train.py \
    --model DiT-XL/2 \
    --batch_size 3 \
    --num_classes 10 \
    --sequence_parallel_type longseq \
    --sequence_parallel_size 2 \
    --enable_modulate_kernel \
    --enable_flashattn \
    --enable_layernorm_kernel

the duration for each epoch doubled to 3 hours: Epoch 0: 0%| | 34/16666 [06:33<3:17:52, 1.40it/s, loss=0.943, step=33, global_step=33].

Could you please explain the reason behind this increased processing time?

Answer 1 · 2024-03-08T09:59:55.000Z

do not use sequence parallelism if not necessary. And add batch size as much as you can. you can follow our instructions in the readme for more details.

Answer 2 · 2024-03-08T11:39:14.000Z

Thank you for the guidance. However, I'm curious about the '80% speedup' mentioned in the readme. Could you clarify how to achieve this performance improvement?

Answer 3 · 2024-03-08T11:40:45.000Z

enable all kernels (except modulate kernel because it has accuracy problem now) and use as large batch size as you can