model output is NAN for toy dataset
jonbakerfish opened this issue · 8 comments
Hi, I run the infer.py
script with the toy dataset. For some frames, the self.model
's outputs (proj_output
, last_feature
) are NaN arrays. Why?
Hi, Thank you for your interest in our work.
A similar issue is #6. However, I'm not entirely sure where the problem is, maybe it's related to SoftPool. Can you provide more environment information like gpu, cuda version, conda env yaml file etc ?
It seems that the nan is caused by softpool, if I disable it things run fine.
tools | version |
---|---|
GPU | RTX 3070 Laptop |
CUDA | 11.6 |
torch | 1.12.1+cu116 |
torchaudio | 0.12.1+cu116 |
torchsparse | 1.4.0 |
torchvision | 0.13.1+cu116 |
Hi @jonbakerfish, if you are sure that it is a softpool problem, you can refer to alexandrosstergiou/SoftPool#12 and alexandrosstergiou/adaPool#2.
I tried to reproduce this nan issue on 3090, 2080Ti, v100, 2070 Super GPU servers, but failed. 😅
If you have time, please, I hope you can debug it in depth to help others.
Hi @jonbakerfish , I finally found the cause of this problem!
I ignored the version of softpool, its author updated the repo on 2022/04/07, and our project was implemented before that, using its historical version, the commits id is 2d2ec6d
.
When I use the new version d056ab8
of softpool code for inference, nan will also appear, just like what you encounter in the process of inference or train.
The specific reason may need to carefully check the softpool code and discuss with the author, but rolling back the softpool version is a quick solution to this problem.
git clone https://github.com/alexandrosstergiou/SoftPool.git
cd SoftPool
git checkout 2d2ec6d # rollback to 2d2ec6dca10b7683ffd41061a27910d67816bfa5
cd pytorch
make install
--- (optional) ---
make test
I hope you can help to check it, if you have any questions, please contact me again.
Just for reference:
Model worked fine on the toy dataset, but got NAN on data from my own sensor. Reason was SoftPool, even with older Version 2d2ec6d*. Disabling fixed the issue.
Just for reference: Model worked fine on the toy dataset, but got NAN on data from my own sensor. Reason was SoftPool, even with older Version 2d2ec6d*. Disabling fixed the issue.
Hi @L-Reichardt, thank you for your feedback. The temporary plan of rolling back to 2d2ec6d has been verified by several developers, and there should be no problem with the data set used in this project.
Disabling Softpool may cause a slight decrease in performance, as demonstrated by the ablation experiments in the paper
I hope you can confirm two questions. First, whether the installation was successfully replaced with the new version after rolling back the version. Second, your own data is clean and does not contain nan.
@MaxChanger my bad, you are correct. Recently I updated to a newer SoftPool version in order to use PyTorch 1.13.1, but forgot about it.
I used your "const inf" suggestion and it works fine now.
@L-Reichardt Great to hear that 🙃