QwenLM/Qwen2.5-Coder

[Train bug] Gradient Explosion in SFT training stage with DeepSpeed ZeRO-2

Opened this issue · 5 comments

梯度爆炸

I used a self-built FIM SFT dataset for fine-tuning, and encountered abnormal loss when training with DeepSpeed ZeRO2. However, the same dataset did not have this issue on CodeQwen1.5. After switching to ZeRO3, the training proceeded normally. Is this a problem with the model architecture or an incompatibility with the DeepSpeed version?
BTW, the version of my DeepSpeed is 0.13.2

cyente commented

Here are our best SFT practices, which you can refer to in order to verify if there are any configuration errors.

https://github.com/QwenLM/Qwen2.5-Coder/tree/main/sft

Noted that, we have made an update to the special tokens from codeqwen1.5 to qwen2.5-coder. Please confirm whether there are any issues related to special tokens during the training process.

{
  "<|fim_prefix|>": 151659, 
  "<|fim_middle|>": 151660, 
  "<|fim_suffix|>": 151661, 
  "<|fim_pad|>": 151662, 
  "<|repo_name|>": 151663, 
  "<|file_sep|>": 151664, 
  "<|im_start|>": 151644, 
  "<|im_end|>": 151645
}

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

small LR may work

Big shout to your team. I did check the new special token format , but still meet the same problem with ZeRO2. BTW, is there any plan to provide the unsupervised-training exmaples?

Could you reproduce the current SFT script's solution? If there are any issues, please provide more detailed reproducible content to assist further.

Okay, I will provide further information once the script adaptation is completed.