cylnlp/dialogsum

T5-Model can't run the bart_train.py with fp16=True

huangfu170 opened this issue · 1 comments

Thank you for your excellent contribution to NLP, I recently tried use bart_train.py to train a T5-large model with fp16=True in NVIDIA V100 gpu but failed. The output loss is NAN. It was solved when I turn it off. Does it a Huggingface or Pytorch's bug? I find it have been solved in 2021, but the fix seems not works well in your awesome dataset.
Anyway, your dataset helps me a lot.

cylnlp commented

It is a well known problem that one cannot train T5 models using fp16.
You can use fp32 instead.

The bart_train.py uses an old Pytorch version I think.
If you want to adapt to the new version, some modification is needed.