facebookresearch/mae

MAE finetune train loss nan

CodingMice opened this issue ยท 5 comments

info : Loss is nan, stopping training.

Hey @CodingMice ,

It might be due to the amp.autocast() . Disable it via amp.autocast(enabled=False) solves my problem.

It would be great if more context is provided here. There could be multiple ways the Loss goes to NaN, and amp can indeed be one of them.

FYI - This PyTorch issue thread with a long history could be a hint...
pytorch/pytorch#40497

And here's troubleshooting for the issue (also suggested in the thread):
https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#loss-is-inf-nan

Anyway, a quick fix would be as commented by @Jeff-LiangF.

exx8 commented

I've faced the same issue.
Remarkably, using gradient clipping has solved the issue + improved the results.

I've faced the same issue. Remarkably, using gradient clipping has solved the issue + improved the results.

how to set the value of gradient clipping? 0.1?