Loss goes to Nan

Question

Loss goes to Nan

prrw opened this issue 2 years ago · 0 comments

Loss goes to Nan from 4th Step of 1st Epoch. But the same training script works well on another computer with another graphic card. Requirement list is same for both. Both are Windows.

Works here:
NVIDIA Quadro M2200
Driver version: 471.11
CUDA Version: 11.4

Does not work here as loss goes to nan:
NVIDIA RTX A4000
Driver version: 512.59
CUDA Version: 11.6 (CUDA 11.2 is also installed but 11.6 is active)

Also before the loss goes to Nan, it takes lot of time like 30 mins to start the training which is not the case in another PC.

Requirements:

blas                      1.0                         mkl
certifi                   2022.9.14        py37haa95532_0
cudatoolkit               10.2.89              h74a9793_1
freetype                  2.10.4               hd328e21_0
intel-openmp              2021.4.0          haa95532_3556
jpeg                      9e                   h2bbff1b_0
lerc                      3.0                  hd77b12b_0
libdeflate                1.8                  h2bbff1b_5
libpng                    1.6.37               h2a8f88b_0
libtiff                   4.4.0                h8a3f274_0
libwebp                   1.2.2                h2bbff1b_0
lz4-c                     1.9.3                h2bbff1b_1
mkl                       2021.4.0           haa95532_640
mkl-service               2.4.0            py37h2bbff1b_0
mkl_fft                   1.3.1            py37h277e83a_0
mkl_random                1.2.2            py37hf11a4ad_0
ninja                     1.10.2               haa95532_5
ninja-base                1.10.2               h6d14046_5
numpy                     1.21.5           py37h7a0a035_3
numpy-base                1.21.5           py37hca35cd5_3
opencv-python             4.6.0.66                 pypi_0    pypi
pillow                    9.2.0            py37hdc2b20a_1
pip                       22.1.2           py37haa95532_0
protobuf                  3.20.1                   pypi_0    pypi
pyconfigparser            1.0.5                    pypi_0    pypi
python                    3.7.0                hea74fb7_0
pytorch                   1.6.0           py3.7_cuda102_cudnn7_0    pytorch
pyyaml                    6.0                      pypi_0    pypi
setuptools                63.4.1           py37haa95532_0
six                       1.16.0             pyhd3eb1b0_1
tensorboardx              2.5.1                    pypi_0    pypi
tk                        8.6.12               h2bbff1b_0
torchvision               0.7.0                py37_cu102    pytorch
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
wheel                     0.37.1             pyhd3eb1b0_0
wincertstore              0.2              py37haa95532_2
xz                        5.2.5                h8cc25b3_1
zlib                      1.2.12               h8cc25b3_3
zstd                      1.5.2                h19a0ad4_0

I feel is the NVIDIA CUDA version issue. What can go wrong ?