kazewong opened this issue a year ago · 0 comments
It has been observed in multiple occasion where the loss can suddenly jump to a high value and essentially destroy whatever it was learned before.
Implementing gradient clipping should alleviate this problem.