microsoft/DeepSpeed

[BUG] triton kernel, loss 0, grar-norm nan

mdy666 opened this issue · 0 comments

Describe the bug
i replace the naive kernel in transformers with my triton kernel, after a few steps the loss become 0 and grad_norm become nan

version: deepspeed: 15.4,

i use zero2 train qwen2.5 7B, it's ok when i train qwen2.5 0.5B wihout zero
Image