facing "nan" values during training the model

Question

facing "nan" values during training the model

baogiadoan opened this issue 4 years ago · 1 comments

hi, during the training with my custom objective loss, I realized that sometimes the model went wrong and produce "nan" and become invalid; which I didn't face before with other training methods, is that because of the learning rate of the cyclic learning rate being too large and causing the loss to diverge as mentioned in the paper: For each method, we individually tune λ to be as large as possible without causing the training loss to diverge? or is it a bug?

I ran the original again with epochs=30 and also faced the same issue:

Answer 1 · 2022-11-10T14:32:35.000Z

You could try training with a smaller learning rate or clipping the (unscaled) gradients. Do you notice the same behavior with and without training with mixed precision?