Nan loss in training
tranquangchung opened this issue · 5 comments
tranquangchung commented
Hi
Thanks for sharing your project.
When I trained your model based on your config, however, the val and train loss was NAN.
I tried many times but the results are still the same.
Can you tell me the reasons and how to solve it?
tranquangchung commented
The problem made NAN is the Language model. So, I solved this problem by modifying your code, and it worked very well.
Sreyan88 commented
Hi @tranquangchung , How did you solve the nan problem? Thank You!
BingliangLi commented
Hi, could you please explain how do you solve this problem? Thx!
BingliangLi commented
It turns out the problem is with google/flan-t5-large, this model does not support fp16 training, use fp32 it would be fine.
soujanyaporia commented
Glad to know that it was solved. FYI we have released Tango 2: https://arxiv.org/abs/2404.09956