Audio-WestlakeU/NBSS

Loss Nan Value

PriyankaPaud opened this issue · 5 comments

I am getting the value for loss as Nan

And cuda error while training

I didn't encounter this problem. Did you use 16 bit precision training?

如果是fp16训练遇到nan是正常的吗?

quancs commented

正常的,可以用之前epoch的checkpoint使用32精度继续训练

quancs commented

@xxchauncey 可以用bf16,性能比fp16差点,但不怎么遇到nan

@xxchauncey 可以用bf16,性能比fp16差点,但不怎么遇到nan

感谢,我是最近才接触audio separation这一块的,前阵子切换了好几种backbone都会在训练中期出现nan,在v100卡上解决方案只能是切回32精度继续训练。以前不管是asr还是小型nlp模型都没有碰到过这样的情况,所以比较好奇。