alexandrosstergiou/SoftPool

find out nan in tensor

PJJie opened this issue · 4 comments

PJJie commented

When I replace multiple maxtools with softtools, I find out Nan in tensor

Returned NaN values are quite common when using CUDA as it is a low-level language and it does not integrate any internal checks for numerical overflows or underflows. PyTorch itself has a range of functions (e.g. torch.nan_to_num()) to deal with such cases. Simply wrapping your output with these functions should alleviate the issue.

I am also planning on including this in the coming repo commits.

Best,
Alex

Hi, @alexandrosstergiou, I would like to know if this bug has been fixed or any progress? I'm also using softpool in a project and I don't have this problem, but other people have this problem with my project haomo-ai/MotionSeg3D#6

Hi @MaxChanger. Most NaNvalue-problems in fwd/bwd calls have been fixed after torch 1.6 where torch.amp was integrated alongside its decorators for custom functions. After commit f49fd84, I had stable runs on both full and mixed precision settings over different GPUs, environments, and configurations. Since then I have not noticed any NaN values occurring whilst training in other projects.

Perhaps it will be worth suggesting to anyone opening an issue in your project to re-install the latest version of softpool and ensure that they are using torch >= 1.7 (preferably the latest one) to be sure?

Hi @alexandrosstergiou. Thank you for your kind reply. I have conducted nearly a hundred experiments on 4~5 different GPU servers, and I have not found this issue (nan) too. Thus, I thought your project was robust enough.
After your confirmation, I am more at ease, and I will also cooperate with other people to confirm this issue.