declare-lab/tango

Nan loss in training

tranquangchung opened this issue · 5 comments

Hi
Thanks for sharing your project.
When I trained your model based on your config, however, the val and train loss was NAN.
I tried many times but the results are still the same.
Can you tell me the reasons and how to solve it?

The problem made NAN is the Language model. So, I solved this problem by modifying your code, and it worked very well.

Hi @tranquangchung , How did you solve the nan problem? Thank You!

Hi, could you please explain how do you solve this problem? Thx!

It turns out the problem is with google/flan-t5-large, this model does not support fp16 training, use fp32 it would be fine.

Glad to know that it was solved. FYI we have released Tango 2: https://arxiv.org/abs/2404.09956