[Train bug] Gradient Explosion in SFT training stage with DeepSpeed ZeRO-2

Question

[Train bug] Gradient Explosion in SFT training stage with DeepSpeed ZeRO-2

Opened this issue 2 days ago · 5 comments

I used a self-built FIM SFT dataset for fine-tuning, and encountered abnormal loss when training with DeepSpeed ZeRO2. However, the same dataset did not have this issue on CodeQwen1.5. After switching to ZeRO3, the training proceeded normally. Is this a problem with the model architecture or an incompatibility with the DeepSpeed version?
BTW, the version of my DeepSpeed is 0.13.2

Answer 1 · 2024-09-27T08:28:24.000Z

Here are our best SFT practices, which you can refer to in order to verify if there are any configuration errors.

https://github.com/QwenLM/Qwen2.5-Coder/tree/main/sft

Noted that, we have made an update to the special tokens from codeqwen1.5 to qwen2.5-coder. Please confirm whether there are any issues related to special tokens during the training process.

{
  "<|fim_prefix|>": 151659, 
  "<|fim_middle|>": 151660, 
  "<|fim_suffix|>": 151661, 
  "<|fim_pad|>": 151662, 
  "<|repo_name|>": 151663, 
  "<|file_sep|>": 151664, 
  "<|im_start|>": 151644, 
  "<|im_end|>": 151645
}

Answer 2 · 2024-09-27T09:19:41.000Z

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

Answer 3 · 2024-09-27T15:46:25.000Z

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

small LR may work

Answer 4 · 2024-09-29T06:18:25.000Z

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

Could you reproduce the current SFT script's solution? If there are any issues, please provide more detailed reproducible content to assist further.

Answer 5 · 2024-09-29T08:48:47.000Z

Okay, I will provide further information once the script adaptation is completed.