Assertion Error On Finiteness
MichaelYu781 opened this issue · 2 comments
Hi, @stepankonev .
Thanks for sharing your code. It really helps a lot. But when i try to train the model using the code, the assertion error assert torch.isfinite( ... ).all()
is always raised, about four or five epochs, leading to the training process interrupted. Is this a normal phenomenon? Is there any suggestion on what i should do to avoid the interruption and complete the training?
Hi, @MichaelYu781!
Thank you for your attention to the repository. There might be some stability issues indeed. I have restarted training from the checkpoint a few times and it helped to avoid the assertion, however this might be acceptable if the deadline in really coming soon) The other way that will require some more time is to balance batch size, learning rate and gradient clipping parameters. I hope to dive into the case but I only will be able to do that some later when I have time. If you have any ideas how to improve the robustness of the model please share them here.
Thanks!
Thanks for your reply, sir !