[BUG] triton kernel， loss 0， grar-norm nan

Question

mdy666 opened this issue 12 days ago · 0 comments

Describe the bug
i replace the naive kernel in transformers with my triton kernel， after a few steps the loss become 0 and grad_norm become nan

version: deepspeed: 15.4,

i use zero2 train qwen2.5 7B, it's ok when i train qwen2.5 0.5B wihout zero