RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":830

Question

RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":830

NikhilMank opened this issue 9 months ago · 4 comments

I am running a training code. It is running without any problem for 18853 iterations and then throwing this error. I have not written anything new at this point, the whole code was executed for many iterations before throwing this error.
The graph network was operated in 'rev' mode starting from iter.No 12000
I have tried to run the training code multiple times and the same error occured at the same iteration and same code line.
I got the same error at different points with different batch_sizes. For batch_size = 8, the same error occured at around 17500 iterations. For batch_size = 16, the error occured at 18853 for three runs.

The error is a direct cause from the code line of your library coupling block.

Answer 1 · 2024-04-16T13:36:55.000Z

That’s a tough one, I have had success debugging such a case by training on the CPU and then obtaining a stack trace to the actual problem (CUDA can be nasty to debug given that it’s not synchronous), might also work with CUDA_LAUNCH_BLOCKING=1

Answer 2 · 2024-04-18T16:08:11.000Z

@fdraxler Thanks for reply, I will try running the code after setting CUDA_LAUNCH_BLOCKING=1 on the gpu

Answer 3 · 2024-04-24T09:39:36.000Z

@fdraxler I have tried training with CUDA_LAUNCH_BLOCKING=1 but the error message was same, there was no extra information. I am training on CPU now.

Answer 4 · 2024-04-30T11:48:26.000Z

@fdraxler Hi, I have ran it on CPU and observed that the loss values are becoming 0 after the failure point and there is a massive increase in the memory usage, I guess the values are shooting up. I have used a conv network as the constructor for GlowCouplingBlock. I am attaching the network which I have used as an image. I performed NLLLoss on this network. Do you think using CNN as constructor in GLOWCouplingBlock is causing any problem?