RuntimeError: CUDA error: unknown error

Question

RuntimeError: CUDA error: unknown error

albertz opened this issue 7 months ago · 0 comments

It's likely a hardware issue. Similarly, there is also #1465. I just want to report this for future reference.

Multi GPU training (but that's likely not relevant), log (/work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Hh9Pv7JpsMlW/engine/i6_core.returnn.training.ReturnnTrainingJob.Hh9Pv7JpsMlW.run.7238888.1):

RETURNN starting up, version 1.20240522.175941+git.36de1fe4, date/time 2024-05-26-15-40-48 (UTC+0000), pid 116430, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.Hh9Pv7JpsMlW/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11
...
PyTorch: 2.1.0+cu121 (7bcf7da3a268b435777fe87c7794c382f444e86d) (<site-package> in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch)
...
Torch: Hostname cn-229, pid 116429, using GPU 2.
Torch: Hostname cn-229, pid 116430, using GPU 3.
Torch: Hostname cn-229, pid 116427, using GPU 0.
Torch: Hostname cn-229, pid 116428, using GPU 1.
...
Start running torch distributed training on local rank 2.
Using gpu device 2: NVIDIA GeForce GTX 1080 Ti
Using device: cuda ('gpu' in config)
Using device: cuda ('gpu' in config)
Start running torch distributed training on local rank 1.
Start running torch distributed training on local rank 0.
Using gpu device 1: NVIDIA GeForce GTX 1080 Ti
Using gpu device 0: NVIDIA GeForce GTX 1080 Ti
Using device: cuda ('gpu' in config)
Start running torch distributed training on local rank 3.
Using gpu device 3: NVIDIA GeForce GTX 1080 Ti
Total GPU 1 memory 10.9GB, free 10.8GB
Learning-rate-control: loading file learning_rates
Total GPU 2 memory 10.9GB, free 10.8GB
Total GPU 0 memory 10.9GB, free 10.8GB
Learning-rate-control: loading file learning_rates
Total GPU 3 memory 10.9GB, free 10.8GB
...
ep 361 train, step 0, ctc_4 2.316, ctc_8 1.951, ctc 1.867, num_seqs 47, max_size:time 50184, max_size:out-spatial 18, mem_usage:cuda:2 5.9GB, 4.276 sec/step
ep 361 train, step 0, ctc_4 1.944, ctc_8 1.478, ctc 1.400, num_seqs 39, max_size:time 57040, max_size:out-spatial 17, mem_usage:cuda:0 5.6GB, 4.338 sec/step
ep 361 train, step 0, ctc_4 1.362, ctc_8 0.962, ctc 0.790, num_seqs 42, max_size:time 55881, max_size:out-spatial 16, mem_usage:cuda:1 5.9GB, 5.312 sec/step
ep 361 train, step 0, ctc_4 1.826, ctc_8 1.497, ctc 1.372, num_seqs 38, max_size:time 61953, max_size:out-spatial 15, mem_usage:cuda:3 6.0GB, 4.215 sec/step
ep 361 train, step 1, ctc_4 1.502, ctc_8 1.155, ctc 1.010, num_seqs 40, max_size:time 59688, max_size:out-spatial 20, mem_usage:cuda:2 6.0GB, 0.606 sec/step
ep 361 train, step 1, ctc_4 1.574, ctc_8 1.248, ctc 1.063, num_seqs 31, max_size:time 76640, max_size:out-spatial 21, mem_usage:cuda:0 6.0GB, 0.658 sec/step
ep 361 train, step 1, ctc_4 1.956, ctc_8 1.644, ctc 1.465, num_seqs 37, max_size:time 64417, max_size:out-spatial 23, mem_usage:cuda:1 6.0GB, 0.627 sec/step
ep 361 train, step 1, ctc_4 1.681, ctc_8 1.323, ctc 1.253, num_seqs 31, max_size:time 76560, max_size:out-spatial 19, mem_usage:cuda:3 6.0GB, 0.619 sec/step
...
ep 379 devtrain eval, step 259, ctc_4 0.483, ctc_8 0.252, ctc 0.185, mem_usage:cuda:0 2.2GB
ep 379 devtrain eval, step 260, ctc_4 0.599, ctc_8 0.433, ctc 0.342, mem_usage:cuda:0 2.2GB
ep 379 devtrain eval, step 261, ctc_4 0.298, ctc_8 0.195, ctc 0.149, mem_usage:cuda:0 2.2GB
dev: score ctc_4 0.585 ctc_8 0.407 ctc 0.370 error None devtrain: score ctc_4 0.384 ctc_8 0.215 ctc 0.173 error None
Memory usage (cuda:0): alloc cur 1.7GB alloc peak 2.2GB reserved cur 8.7GB reserved peak 8.7GB
start epoch 380 global train step 470345 with effective learning rate 0.00041155071186440683 ...
start epoch 380 global train step 470345 with effective learning rate 0.00041155071186440683 ...
start epoch 380 global train step 470345 with effective learning rate 0.00041155071186440683 ...
16 epochs stored so far and keeping all.
start epoch 380 global train step 470345 with effective learning rate 0.00041155071186440683 ...
Memory usage (cuda:2): alloc cur 2.0GB alloc peak 2.0GB reserved cur 8.5GB reserved peak 8.5GB
Memory usage (cuda:0): alloc cur 1.7GB alloc peak 1.7GB reserved cur 8.7GB reserved peak 8.7GB
Memory usage (cuda:1): alloc cur 1.4GB alloc peak 1.4GB reserved cur 8.4GB reserved peak 8.4GB
Memory usage (cuda:3): alloc cur 2.1GB alloc peak 2.1GB reserved cur 8.7GB reserved peak 8.7GB
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

...

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x71d606992617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x71d60694d98d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so) 
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x71d606ccd9f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x169b6 (0x71d606c969b6 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1947d (0x71d606c9947d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1989d (0x71d606c9989d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x513c46 (0x71d5c7730c46 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x55ca7 (0x71d606977ca7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x71d60696fcb3 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x71d60696fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x4bd16c7 (0x71d5b4eda6c7 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) 
frame #11: torch::autograd::deleteNode(torch::autograd::Node*) + 0xa9 (0x71d5b4ed2b59 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) 
frame #12: std::_Sp_counted_deleter<torch::autograd::generated::SumBackward0*, void (*)(torch::autograd::Node*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xe (0x71d5b45af1ee in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4ba8990 (0x71d5b4eb1990 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: c10::TensorImpl::~TensorImpl() + 0x1da (0x71d60696fcaa in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #15: c10::TensorImpl::~TensorImpl() + 0x9 (0x71d60696fe49 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #16: <unknown function> + 0x7c84d8 (0x71d5c79e54d8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so) 
frame #17: THPVariable_subclass_dealloc(_object*) + 0x305 (0x71d5c79e5865 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #33: <unknown function> + 0x291b7 (0x71d632f891b7 in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #34: __libc_start_main + 0x7c (0x71d632f8926c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #35: _start + 0x21 (0x401071 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11)

Fatal Python error: Aborted

Current thread 0x000071d632f5f000 (most recent call first):
  Garbage-collecting
  <no Python frame>
Signal handler: signal 6:
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/c14b833885/native_signal_handler.so(signal_handler+0x4b)[0x71d606c7c20b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x71d632f9cf40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x71d632fe6e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x71d632f9cea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x71d632f9cf40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x71d632fe6e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x71d632f9cea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x71d632f8845c]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xa58d9)[0x71d60836b8d9]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xb0f0a)[0x71d608376f0a]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(+0xaff79)[0x71d608375f79]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6(__gxx_personality_v0+0x86)[0x71d608376696]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libgcc_s.so.1(+0x17934)[0x71d6324a6934]
/work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x71d6324a738d]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x117f7)[0x71d606c917f7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x1989d)[0x71d606c9989d]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x513c46)[0x71d5c7730c46]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(+0x55ca7)[0x71d606977ca7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1e3)[0x71d60696fcb3]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x71d60696fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4bd16c7)[0x71d5b4eda6c7]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN5torch8autograd10deleteNodeEPNS0_4NodeE+0xa9)[0x71d5b4ed2b59]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZNSt19_Sp_counted_deleterIPN5torch8autograd9generated12SumBackward0EPFvPNS1_4NodeEESaIvELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0xe)[0x71d5b45af1ee]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4ba8990)[0x71d5b4eb1990]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD1Ev+0x1da)[0x71d60696fcaa]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x71d60696fe49]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0x7c84d8)[0x71d5c79e54d8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_python.so(_Z28THPVariable_subclass_deallocP7_object+0x305)[0x71d5c79e5865] 
/work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0(+0x1edb1d)[0x71d633441b1d]
...
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
...

At the end, it hangs at exit, probably also due to a hardware problem.