vislearn/FrEIA

RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":830

NikhilMank opened this issue · 4 comments

  • I am running a training code. It is running without any problem for 18853 iterations and then throwing this error. I have not written anything new at this point, the whole code was executed for many iterations before throwing this error.
  • The graph network was operated in 'rev' mode starting from iter.No 12000
  • I have tried to run the training code multiple times and the same error occured at the same iteration and same code line.
  • I got the same error at different points with different batch_sizes. For batch_size = 8, the same error occured at around 17500 iterations. For batch_size = 16, the error occured at 18853 for three runs.

image

The error is a direct cause from the code line of your library coupling block.

That’s a tough one, I have had success debugging such a case by training on the CPU and then obtaining a stack trace to the actual problem (CUDA can be nasty to debug given that it’s not synchronous), might also work with CUDA_LAUNCH_BLOCKING=1

@fdraxler Thanks for reply, I will try running the code after setting CUDA_LAUNCH_BLOCKING=1 on the gpu

@fdraxler I have tried training with CUDA_LAUNCH_BLOCKING=1 but the error message was same, there was no extra information. I am training on CPU now.

@fdraxler Hi, I have ran it on CPU and observed that the loss values are becoming 0 after the failure point and there is a massive increase in the memory usage, I guess the values are shooting up. I have used a conv network as the constructor for GlowCouplingBlock. I am attaching the network which I have used as an image. I performed NLLLoss on this network. Do you think using CNN as constructor in GLOWCouplingBlock is causing any problem?

image
image
image