[Question]: 在Qwen2上finetune时（作为多模态的llm），每隔几个step会出现一次前向耗时异常（正常step的4-10倍左右），导致整体训练速度较慢

Question

[Question]: 在Qwen2上finetune时（作为多模态的llm），每隔几个step会出现一次前向耗时异常（正常step的4-10倍左右），导致整体训练速度较慢

CSammyfd opened this issue 2 months ago · 0 comments

Has this been raised before?

I have checked the GitHub README.
I have checked the Qwen documentation and cannot find an answer there.
I have searched the issues and there is not a similar one.
I confirm that this is not a bug report, a feature request, or a badcase.

Description

背景：用internvl作为codebase进行多模态模型训练时发生此现象，拖慢pretrain的训练速度

一些细节与观察：
1）仅在pretrain时发生此现象，sft时耗时稳定（所以想不明白），已保证两者的训练参数基本一致（注：仅开放adapter训练）
2）经观察，和当前样本的seq长度似乎无关
3）进行细节的耗时打印发现异常耗时发生时，主要有decoderlayer引发，各层decoderlayer的耗时会呈现快
->慢->极慢(1000倍耗时)->慢的特点
异常耗时主要由极慢的那层贡献
4）耗时基本由self.attn贡献，sdpa和flashattn均如此

目前不清楚产生的原因及接下去的排查/解决思路，希望有了解的同学可以告知思路或原因，感谢~