Coobiw/MPP-LLaVA

DeepSpeed的PP需要相同的seq-length(collate时注意padding)和batchsize(将dataloader的`drop_last`设为True)

Closed this issue · 11 comments

网上关于Deepspeed流水线并行的资料实在太少了...我遇到个问题需要请教一下,麻烦大佬有时间帮忙分析一下...
我仿照大佬的代码写了一个其他VLM的training code,在训练过程中会遇到一个奇怪的问题:
setting如下:num_stages=4, ngpus_per_node=8,那么pp=4,dp=2,然后rank0和rank1会分别有两个batch:B1和B2,假设B1和B2的序列长度分别为N1和N2。
然后在autograd时候就出错了,说的是Mismatch shape错误,grad的shape为N1,output的shape为N2,相当于autograd时候用了rank1的batch B1去更新rank2的batch B2了。

这个BUG或者问题我实在无从下手解决,也没搜集到相关资料。

P.S.: 我在LLM的BlockPipeLayer中打印了一下,发现B1的数据完整的forward了所有层,B2的数据只forward了前20多个层,后面的层还没传播完。是不是哪里的同步有问题啊?

仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥?

hi,想先问两个问题: 1.序列处理是否是按照本库的处理,因为我的preprocess不会出现不等长序列 2.错误的log和你的输出截图方便提供一下吗

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 16:47 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Subscribed @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25) 仔细看了一下,不是跨rank了,是在rank0内,从GPU1的pipe(layer0~layer20)到GPU2的pipe(layer21-34),tensor在传输过程中,序列长度发生变化了(2369->2262,这是为啥? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

hi感谢您的回复,

  1. 我用的InternVL里的序列处理方法,最终batch拿到的序列长度肯定是一样的,在preprocess和collator中已经处理完毕了。
  2. 最终报错为(其中一个,还有另一个一样的):
    RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2361, 6144]) and output[0] has a shape of torch.Size([4, 2481, 6144])

比较完整的错误报告(我打印了each layer的input_embeds.shape,还有我在collator中传入的一个tag【内容为input_ids.sum()与random.randint(100,2000)的一个拼接tensor】)
`dynamic ViT batch size: 29, images per sample: 7.25, dynamic token length: 2262
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:543:forward] ----Synchronization False
[2024-06-21 16:40:50,826] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor([712564646, 1668], device='cuda:0')
dynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2361
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.0 cuda:1 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.1 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.2 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.3 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.4 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.5 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.6 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.7 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.8 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.9 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.13 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.10 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.14 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.11 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.15 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.12 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.16 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.17 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.18 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor([712564646, 1668], device='cuda:0')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.19 cuda:0 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor([712564646, 1668], device='cuda:0')
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor([757955384, 1668], device='cuda:1')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor([757955384, 1668], device='cuda:1')
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.20 cuda:3 tensor([757955384, 1668], device='cuda:3')
tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.21 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor([757955384, 1668], device='cuda:3')
tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.22 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor([757955384, 1668], device='cuda:3')
tensor([712564646, 1668], device='cuda:2')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.23 cuda:3 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:2 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.24 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.25 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.26 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.27 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.28 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:2 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.29 cuda:3 tensor([712564646, 1668], device='cuda:2')
tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.30 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.31 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.32 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.33 cuda:3 tensor([757955384, 1668], device='cuda:3')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.34 cuda:5 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.35 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.36 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.37 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.38 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.39 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.40 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.41 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.42 cuda:5 torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:4 tensor([757955384, 1668], device='cuda:5')
tensor([712564646, 1668], device='cuda:4')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.43 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.44 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.45 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.46 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.47 cuda:5 tensor([757955384, 1668], device='cuda:5')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:6 tensor([712564646, 1668], device='cuda:6')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.48 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.49 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.50 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.51 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.52 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.53 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.54 cuda:7 tensor([757955384, 1668], device='cuda:7')
torch.Size([4, 2361, 6144]) InternLMBlockPipeLayer.55 cuda:7 tensor([757955384, 1668], device='cuda:7')
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-06-21 16:41:08,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_allgather: 10.02 | optimizer_gradients: 32.98 | optimizer_step: 89.77
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
0 batch end
06/21/2024 16:41:11 - INFO - main - {'loss': 2.441850185394287, 'learning_rate': 0.0, 'epoch': 0.0}

Epoch 1: 0%| | 0/15698 [00:26<?, ?it/s, loss=2.44, learning_rate=0, epoch=0]�[A

Epoch 1: 0%| | 1/15698 [00:26<115:52:37, 26.58s/it, loss=2.44, learning_rate=0, epoch=0]�[Adynamic ViT batch size: 31, images per sample: 7.75, dynamic token length: 2481
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.0 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.1 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.2 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.3 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.4 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.5 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.6 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.7 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.8 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.9 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.10 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.11 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.12 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.13 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.14 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.15 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.16 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.17 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.18 cuda:1 tensor([766560813, 1374], device='cuda:1')
torch.Size([4, 2481, 6144]) InternLMBlockPipeLayer.19 cuda:1 tensor([766560813, 1374], device='cuda:1')
Traceback (most recent call last):
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 850, in
if name == 'main':
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 830, in main
with torch.cuda.amp.autocast(dtype=torch.bfloat16, cache_enabled=False):
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in exec_backward_pass
torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 244, in backward
grad_tensors
= make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 88, in _make_grads
raise RuntimeError(
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2361, 6144]) and output[0] has a shape of torch.Size([4, 2481, 6144]).
dynamic ViT batch size: 32, images per sample: 8.0, dynamic token length: 2369
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.0 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.1 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.2 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.3 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.4 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.5 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.6 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.7 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.8 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.9 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.10 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.11 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.12 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.13 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.14 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.15 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.16 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.17 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.18 cuda:0 tensor([790315515, 1374], device='cuda:0')
torch.Size([4, 2369, 6144]) InternLMBlockPipeLayer.19 cuda:0 tensor([790315515, 1374], device='cuda:0')

Epoch 1: 0%| | 1/15698 [00:33<146:05:44, 33.51s/it, loss=2.44, learning_rate=0, epoch=0]
Traceback (most recent call last):
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 850, in
if name == 'main':
File "/mnt/workspace/workgroup/chenghao/video_analysis/internvl_chat_interleaved/internvl/train/intern_vl_chat_finetune_block_pp.py", line 830, in main
with torch.cuda.amp.autocast(dtype=torch.bfloat16, cache_enabled=False):
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 373, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1373, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 789, in exec_backward_pass
torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 244, in backward
grad_tensors
= make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/mnt/workspace/workgroup/miniconda/envs/internvl/lib/python3.9/site-packages/torch/autograd/init.py", line 88, in _make_grads
raise RuntimeError(
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2369, 6144]).
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.20 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.21 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.22 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.23 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.24 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.25 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.26 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.27 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.28 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.29 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.30 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.31 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.32 cuda:2 tensor([790315515, 1374], device='cuda:2')

Epoch: 0it [02:15, ?it/s]
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.33 cuda:2 tensor([790315515, 1374], device='cuda:2')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.34 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.35 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.36 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.37 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.38 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.39 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.40 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.41 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.42 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.43 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.44 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.45 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.46 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.47 cuda:4 tensor([790315515, 1374], device='cuda:4')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.48 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.49 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.50 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.51 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.52 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.53 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.54 cuda:6 tensor([790315515, 1374], device='cuda:6')
torch.Size([4, 2262, 6144]) InternLMBlockPipeLayer.55 cuda:6 tensor([790315515, 1374], device='cuda:6')
[2024-06-21 16:41:30,578] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225308
[2024-06-21 16:41:30,578] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225309
[2024-06-21 16:41:31,133] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 225310`

模型的一个pipeline切分:
[2024-06-21 16:38:26,030] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=21
0: TokenizerPipeLayer
1: InternLMBlockPipeLayer
2: InternLMBlockPipeLayer
3: InternLMBlockPipeLayer
4: InternLMBlockPipeLayer
5: InternLMBlockPipeLayer
6: InternLMBlockPipeLayer
7: InternLMBlockPipeLayer
8: InternLMBlockPipeLayer
9: InternLMBlockPipeLayer
10: InternLMBlockPipeLayer
11: InternLMBlockPipeLayer
12: InternLMBlockPipeLayer
13: InternLMBlockPipeLayer
14: InternLMBlockPipeLayer
15: InternLMBlockPipeLayer
16: InternLMBlockPipeLayer
17: InternLMBlockPipeLayer
18: InternLMBlockPipeLayer
19: InternLMBlockPipeLayer
20: InternLMBlockPipeLayer
stage=1 layers=14
21: InternLMBlockPipeLayer
22: InternLMBlockPipeLayer
23: InternLMBlockPipeLayer
24: InternLMBlockPipeLayer
25: InternLMBlockPipeLayer
26: InternLMBlockPipeLayer
27: InternLMBlockPipeLayer
28: InternLMBlockPipeLayer
29: InternLMBlockPipeLayer
30: InternLMBlockPipeLayer
31: InternLMBlockPipeLayer
32: InternLMBlockPipeLayer
33: InternLMBlockPipeLayer
34: InternLMBlockPipeLayer
stage=2 layers=14
35: InternLMBlockPipeLayer
36: InternLMBlockPipeLayer
37: InternLMBlockPipeLayer
38: InternLMBlockPipeLayer
39: InternLMBlockPipeLayer
40: InternLMBlockPipeLayer
41: InternLMBlockPipeLayer
42: InternLMBlockPipeLayer
43: InternLMBlockPipeLayer
44: InternLMBlockPipeLayer
45: InternLMBlockPipeLayer
46: InternLMBlockPipeLayer
47: InternLMBlockPipeLayer
48: InternLMBlockPipeLayer
stage=3 layers=11
49: InternLMBlockPipeLayer
50: InternLMBlockPipeLayer
51: InternLMBlockPipeLayer
52: InternLMBlockPipeLayer
53: InternLMBlockPipeLayer
54: InternLMBlockPipeLayer
55: InternLMBlockPipeLayer
56: InternLMBlockPipeLayer
57: FLNPipeLayer
58: LMPipeLayer
59: LossPipeLayer

会发现序列长度为2369和2481的这两个序列,好像过了stage0(layer19)之后就被阻断了,然后这个grad_tensor和output之间的match关系也比较混乱。。。

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下
比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

对的,stage划分是1357 0246这样,collator最后返回的是按照大佬博客里讲的Tuple[Tuple[torch.Tensor], Any]这样的形式,我仔细研究了一下,测试的时候发现是,出错的那几个batch,activation在从GPU0(GPU1)转移到GPU2(GPU3)的时候,出现问题,变成这个rank上个step的batch的序列形状了。我举个例子:
比如在rank1(GPU0246)上,第一个batch的input_embeds是(4,2262,6144)的形状,第二个batch是(4,2361,6144)的形状。第一个batch的forward和backward是正常的,而第二个batch在GPU0上是正常的,GPU0上每个layer的hidden_states形状都是(4,2361,6144),但是到GPU2上的layer时,形状就都变成(4,2262,6144)了,然后就会报错:
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4, 2262, 6144]) and output[0] has a shape of torch.Size([4, 2361, 6144]).
这是实际情况,我很纳闷为什么会这样。我改变了pp的等级(pp=2/4/8),都是上述这种情况,第一个step都是正常的,第二个开始就会这样了。大佬可以帮忙分析一下为什么吗?非常感谢!!!(如果太占用大佬时间就算惹

感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似:

def collate_fn_minigpt4qwen(batch,preprocess_func):
    image_list, conversation_list = [], []

    for sample in batch:
        image_list.append(sample["image"])
        conversation_list.append(sample["conversations"])

    new_batch = \
        {
            "image": torch.stack(image_list, dim=0),
            "conversations": conversation_list,
        }
    data_dict = preprocess_func(new_batch['conversations'])

    return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        ) # 我这里是Tuple[Tuple[Tensor], Tensor]

可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477

训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。

方便看下你block的代码吗,或者你知乎私我个联系方式 有空一起看一下子?

---- 回复的原邮件 ---- 发件人 Hao cheng @.> 发送日期 2024年06月21日 23:39 收件人 Coobiw/MiniGPT4Qwen @.> 抄送人 Coobiw @.>, Comment @.> 主题 Re: [Coobiw/MiniGPT4Qwen] 哈喽打扰一下询问个问题! (Issue #25) 感觉你的划分应该是1357 0246,然后out普遍是长的那个(其实就是batch里的max_length),grad应该是正确的长度,我感觉你可以检查下你的collator、每个block的input和output,注意一下符合deepspeed的pipeline module的协议,尽可能都以tensor的形式传输。然后注意pipelinemodel会有一个label项作为输入(以tuple形式),检查一下 比如collator大概的返回形式应该和这个函数类似: def collate_fn_minigpt4qwen(batch,preprocess_func): image_list, conversation_list = [], [] for sample in batch: image_list.append(sample["image"]) conversation_list.append(sample["conversations"]) new_batch = \ { "image": torch.stack(image_list, dim=0), "conversations": conversation_list, } data_dict = preprocess_func(new_batch['conversations']) return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']), data_dict['labels'] ) # 我这里是Tuple[Tuple[Tensor], Tensor] 可以参考下我的这个博客里踩过的一些坑:https://zhuanlan.zhihu.com/p/684462477 训练的第一个step应该是正常的,第二个step就有问题了,第二个step的batch只能在stage0对应的GPU上传播正常,从stage0转移到stage1的GPU上时,hidden_states的形状就变为step1中的形状了。大佬的ds版本和torch版本可以说一下吗?我现在搞不清楚这个问题的真正根源是在哪。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

好好好,我知乎私聊大佬!代码的话是在办公笔记本上我没法直接复制粘贴😭

Solved.

We find that DeepSpeed Pipeline Parallel needs the same seq_length in a mini-batch(including many micro-batch) and the same batch-size(so we should set drop_last to True).

This is a good discovery. I'll close but pin this issue.