Illegal memory access CUDA
Closed this issue · 5 comments
Hello I have one weird issue I cannot solve. I have my dataloader and dataset. When I train for lets say 2 images everything is fine, but when I train for >5 images, I always get this error:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
If I write in terminal export CUDA_LAUNCH_BLOCKING=1 then I get error:
RuntimeError: CUDA error: an illegal memory access was encountered
and it reports there is some error in radii>1 in renderer file.
If I remove this radii>1 and just put some nonsense value like 1 then it reports same error on some other place. That's why I cannot figure it out what the error is.
Is it possible that reason is some cuda synchronization or out of memory (which would be weird if it is)?
Thanks
Hi, when illegal cuda memory is accessed, the debugger output is often unreliable. Do you have same problem in the provided dataset? I may find the reason if I can reproduce your problem.
It seems issue was with that, some tensor was of type float64
. When I changed to float32
no error appeared. Do you maybe know why?
Oh yes, the data type compiled in cuda is float32, it will not changing with your input type. Inconsistency of datatype would cause illegal memory access. If neccessary to use float64, you may change the data type in cuda, but it might be quite tricky. So float32 is recommended.
Another possible reason for this CUDA Illegal memory access error is the NaN contained in the rotation quaternions of the Gaussians. When one initializes the rotation with the normal2rotation
function, there might be a chance of the NaN value. One can then use torch.nan_to_num
to tackle such situations.
Thanks @YuePanEdward , it solved another issue besides dtype
that I had.