CUDA memory access error on multiple GPUs

Question

CUDA memory access error on multiple GPUs

Closed this issue 3 years ago · 8 comments

Hello, I am using the average forward warp on multiple GPUs, but I encountered one terrible error which is:
File "xxx/softsplat.py", line 354, in FunctionSoftsplat
tenNormalize[tenNormalize == 0.0] = 1.0
RuntimeError: CUDA error: an illegal memory access was encountered

I am quite confused that it should be one change value on one tensor but it caused illegal memory access error.
Could you please help me with it? it happens after several epochs

Answer 1 · 2021-12-01T16:30:17.000Z

Try running your script with CUDA_LAUNCH_BLOCKING=1 python yourscript.py and let me know what happens.

Answer 2 · 2021-12-01T18:05:05.000Z

Thank you for your reply, after I add it, it turns out:

File "xxx/FlowNet.py", line 93, in forward
warped_img0 = FunctionSoftsplat(tenInput=img0, tenFlow=flow[:,:2], tenMetric=None, strType='average')
File "xxx/softsplat.py", line 350, in FunctionSoftsplat
tenOutput = _FunctionSoftsplat.apply(tenInput, tenFlow)
File "xxx/softsplat.py", line 258, in forward
cupy_launch('kernel_Softsplat_updateOutput', cupy_kernel('kernel_Softsplat_updateOutput', {
File "cupy/cuda/function.pyx", line 201, in cupy.cuda.function.Function.call
File "cupy/cuda/function.pyx", line 183, in cupy.cuda.function._launch
File "cupy_backends/cuda/api/driver.pyx", line 306, in cupy_backends.cuda.api.driver.launchKernel
File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

I found that before the error, the net experienced a shape change on the loss, like:

epoch:6 705/761 time:0.00+0.55 loss_l1:1.9941e-02
epoch:6 706/761 time:0.00+0.51 loss_l1:1.3551e-02
epoch:6 707/761 time:0.00+0.60 loss_l1:1.9836e-02
epoch:6 708/761 time:0.00+0.52 loss_l1:1.9157e-02
epoch:6 709/761 time:0.00+0.55 loss_l1:9.4212e-02
epoch:6 710/761 time:0.00+0.54 loss_l1:4.6343e-02
epoch:6 711/761 time:0.00+0.53 loss_l1:9.0796e-02
epoch:6 712/761 time:0.00+0.52 loss_l1:2.4395e-01
epoch:6 713/761 time:0.00+0.51 loss_l1:3.0267e-01
epoch:6 714/761 time:0.00+0.52 loss_l1:1.7686e-01
epoch:6 715/761 time:0.00+0.53 loss_l1:1.6426e-01

I guess there maybe some boundary on the gradient? I test it on three 2080Tis. And I am testing it with gradient cliping.

Answer 3 · 2021-12-01T18:12:13.000Z

I just updated the softsplat.py, can you try again with the new version?

Answer 4 · 2021-12-02T02:06:03.000Z

Sorry for replying this late, and the error remains the same after I use the new version:

File "xxx/FlowNet.py", line 87, in forward
warped_img1 = FunctionSoftsplat(tenInput=img1, tenFlow=flow[:,2:], tenMetric=None, strType='average')
File "xxx/softsplat.py", line 359, in FunctionSoftsplat
tenOutput = _FunctionSoftsplat.apply(tenInput, tenFlow)
File "xxx/softsplat.py", line 267, in forward
cupy_launch('kernel_Softsplat_updateOutput', cupy_kernel('kernel_Softsplat_updateOutput', {
File "cupy/cuda/function.pyx", line 201, in cupy.cuda.function.Function.call
File "cupy/cuda/function.pyx", line 183, in cupy.cuda.function._launch
File "cupy_backends/cuda/api/driver.pyx", line 306, in cupy_backends.cuda.api.driver.launchKernel
File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.dealloc'
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.dealloc'
Traceback (most recent call last):
File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload
File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

and there core.8 core.9 core.10 core.11 four files generated under the folder, which is not planned, and they are very large. Error always appears after a sharp change on the loss or end of an epoch. I will try to lower down the learning rate, cliping gradient also failed.
Thank you for your reply.

Answer 5 · 2021-12-02T02:23:27.000Z

I just pushed some more changes, maybe those make it work. 🤷‍♂️

Answer 6 · 2021-12-02T05:20:51.000Z

Thank you. Problem solved. I use the new version and decrease the learning rate from 3e-4 to 1e-4 and there is no error.

Answer 7 · 2022-09-26T04:05:50.000Z

I ran into the very same issue today and the root cause seems to be the float to int conversion on the c++ side (e.g. int intNorthwestX = (int) (floor(fltX))) overflows with very large negative fltX. If fltX<=-2^31, intNorthwestX will be cast to minimum int32 value of -2^31, and the boundary condition intNorthwestX >= 0 could evaluate to true due to subsequent signed to unsigned integer conversion. From my test this results in illegal memory access error on centos and cupy cuda 10.2, but is error free on ubuntu and newer cupy+cuda, probably because intNorthwestX >= 0 is handled differently.

There are two ways to get around this:

either clamp tenFlow: e.g. tenFlow = tenFlow.clamp(-10000, 10000)
or specify data type for intNorthwestX >= 0 condition: e,g, intNorthwestX >= (int)0

Answer 8 · 2022-09-27T01:17:36.000Z

Thanks for sharing your findings, clamping the flow is definitely a good idea! 👍