Illegal memory access CUDA

Question

Illegal memory access CUDA

Closed this issue 2 months ago · 5 comments

Hello I have one weird issue I cannot solve. I have my dataloader and dataset. When I train for lets say 2 images everything is fine, but when I train for >5 images, I always get this error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If I write in terminal export CUDA_LAUNCH_BLOCKING=1 then I get error:
RuntimeError: CUDA error: an illegal memory access was encountered
and it reports there is some error in radii>1 in renderer file.
If I remove this radii>1 and just put some nonsense value like 1 then it reports same error on some other place. That's why I cannot figure it out what the error is.

Is it possible that reason is some cuda synchronization or out of memory (which would be weird if it is)?

Thanks

Answer 1 · 2024-08-30T12:08:48.000Z

Hi, when illegal cuda memory is accessed, the debugger output is often unreliable. Do you have same problem in the provided dataset? I may find the reason if I can reproduce your problem.

Answer 2 · 2024-08-30T19:21:13.000Z

It seems issue was with that, some tensor was of type float64 . When I changed to float32 no error appeared. Do you maybe know why?

Answer 3 · 2024-09-02T03:26:13.000Z

Oh yes, the data type compiled in cuda is float32, it will not changing with your input type. Inconsistency of datatype would cause illegal memory access. If neccessary to use float64, you may change the data type in cuda, but it might be quite tricky. So float32 is recommended.

Answer 4 · 2024-09-13T13:16:54.000Z

Another possible reason for this CUDA Illegal memory access error is the NaN contained in the rotation quaternions of the Gaussians. When one initializes the rotation with the normal2rotation function, there might be a chance of the NaN value. One can then use torch.nan_to_num to tackle such situations.

Answer 5 · 2024-09-17T14:16:06.000Z

Thanks @YuePanEdward , it solved another issue besides dtype that I had.