find out nan in tensor
PJJie opened this issue · 4 comments
When I replace multiple maxtools with softtools, I find out Nan in tensor
Returned NaN
values are quite common when using CUDA as it is a low-level language and it does not integrate any internal checks for numerical overflows or underflows. PyTorch itself has a range of functions (e.g. torch.nan_to_num()
) to deal with such cases. Simply wrapping your output with these functions should alleviate the issue.
I am also planning on including this in the coming repo commits.
Best,
Alex
Hi, @alexandrosstergiou, I would like to know if this bug has been fixed or any progress? I'm also using softpool in a project and I don't have this problem, but other people have this problem with my project haomo-ai/MotionSeg3D#6
Hi @MaxChanger. Most NaN
value-problems in fwd/bwd calls have been fixed after torch
1.6 where torch.amp
was integrated alongside its decorators for custom functions. After commit f49fd84, I had stable runs on both full and mixed precision settings over different GPUs, environments, and configurations. Since then I have not noticed any NaN
values occurring whilst training in other projects.
Perhaps it will be worth suggesting to anyone opening an issue in your project to re-install the latest version of softpool and ensure that they are using torch >= 1.7 (preferably the latest one) to be sure?
Hi @alexandrosstergiou. Thank you for your kind reply. I have conducted nearly a hundred experiments on 4~5 different GPU servers, and I have not found this issue (nan
) too. Thus, I thought your project was robust enough.
After your confirmation, I am more at ease, and I will also cooperate with other people to confirm this issue.