RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR in PyTorch Training

Question

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR in PyTorch Training

Closed this issue 8 months ago · 2 comments

Environment:

Windows 10 64-bit 22H2
ZLUDA v3.8
AMD Radeon Pro VII (Vega20) (gfx906)
ROCm and ZLUDA System Environment Variables declared properly
cublas, cusparse, nvrtc lib substituted properly

Description

I was taking AI classes and when I tried to use my local runtime with AMD GPU, there is a error. I've tested this on Colab, it just worked well, might not be code problems. Here is the ipynb code file if you need for debug. It's a small 250 line in class model.
Waldo.zip

Error Log

------------------------------
  Training process started
------------------------------
  0%|          | 0/246 [00:43<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[15], line 14
     12 label = data[1].to(device)
     13 label = label.to(torch.float32)
---> 14 pred = mynet(img)
     15 pred = torch.squeeze(pred)
     16 pred.reshape(-1)

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\container.py:217, in Sequential.forward(self, input)
    215 def forward(self, input):
    216     for module in self:
--> 217         input = module(input)
    218     return input

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\conv.py:460, in Conv2d.forward(self, input)
    459 def forward(self, input: Tensor) -> Tensor:
--> 460     return self._conv_forward(input, self.weight, self.bias)

File D:\Artificial Intelligence\Runtime\venv\lib\site-packages\torch\nn\modules\conv.py:456, in Conv2d._conv_forward(self, input, weight, bias)
    452 if self.padding_mode != 'zeros':
    453     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    454                     weight, bias, self.stride,
    455                     _pair(0), self.dilation, self.groups)
--> 456 return F.conv2d(input, weight, bias, self.stride,
    457                 self.padding, self.dilation, self.groups)

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Answer 1 · 2024-05-19T14:47:03.000Z

Disable cuDNN:

torch.backends.cudnn.enabled = False

Answer 2 · 2024-05-19T19:38:39.000Z

Thanks, it fixed.