mdy666 opened this issue 12 days ago · 0 comments
Describe the bug i replace the naive kernel in transformers with my triton kernel, after a few steps the loss become 0 and grad_norm become nan
version: deepspeed: 15.4,
i use zero2 train qwen2.5 7B, it's ok when i train qwen2.5 0.5B wihout zero