[Issue]: hipErrorNoDevice error when printing `torch.cuda.is_available()`
Closed this issue · 3 comments
Problem Description
Hi.
I use the ROCm 6.2 on wsl ubuntu 24.04. After I install pytorch rocm version under the instruction, I tried to print torch.cuda.is_available()
, but it returned False
.
When I used AMD_LOG_LEVEL=7
to print the additional debug information. I received the following message:
$ AMD_LOG_LEVEL=7 python3 -c 'import torch; print(torch.cuda.is_available())'
:3:rocdevice.cpp :468 : 0518006503 us: [pid:858 tid:0x7f19cb128080] Initializing HSA stack.
:4:runtime.cpp :85 : 0518006987 us: [pid:858 tid:0x7f19cb128080] init
:3:hip_context.cpp :49 : 0518006999 us: [pid:858 tid:0x7f19cb128080] Direct Dispatch: 1
:3:hip_device_runtime.cpp :651 : 0518007009 us: [pid:858 tid:0x7f19cb128080] hipGetDeviceCount ( 0x7fff0ec5b3c8 )
:3:hip_device_runtime.cpp :653 : 0518007026 us: [pid:858 tid:0x7f19cb128080] hipGetDeviceCount: Returned hipErrorNoDevice :
:3:hip_error.cpp :36 : 0518007039 us: [pid:858 tid:0x7f19cb128080] hipGetLastError ( )
:3:hip_error.cpp :36 : 0518007049 us: [pid:858 tid:0x7f19cb128080] hipGetLastError: Returned hipErrorNoDevice :
False
:3:hip_device_runtime.cpp :620 : 0518095352 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518095390 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518095395 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518097321 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518097343 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518097346 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518099229 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518099253 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518099256 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518100459 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518100481 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518100484 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518101450 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518101471 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518101474 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518102353 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518102374 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518102377 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518103070 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518103091 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518103094 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518103812 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518103832 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518103835 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
:3:hip_device_runtime.cpp :620 : 0518104532 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize ( )
:3:hip_device_runtime.cpp :620 : 0518104551 us: [pid:858 tid:0x7f19cb128080] hipDeviceSynchronize: Returned hipErrorNoDevice :
:1:hip_platform.cpp :182 : 0518104554 us: [pid:858 tid:0x7f19cb128080] Error during hipDeviceSynchronize, error: 100
... (The rest is repeating "Error during hipDeviceSynchronize, error: 100" etc.)
I tried to add my current user to render
and video
group, and reboot the wsl. But the situation is the same(my username is sam
:
$ getent group video
video:x:44:sam
$ getent group render
video:x:44:sam
Does anyone have any thoughts on this? :)
Operating System
Ubuntu 24.04.1 LTS (Noble Numbat) (WSL2)
wsl -l -v
NAME STATE VERSION
* Ubuntu-24.04 Running 2
CPU
13th Gen Intel(R) Core(TM) i5-13600K
GPU
AMD Radeon RX 7900 GRE
ROCm Version
ROCm 6.2.3
ROCm Component
ROCm
Steps to Reproduce
- install
amdgpu
:sudo apt install ./amdgpu-install_6.2.60203-1_all.deb
- install the ROCm:
amdgpu-install -y --usecase=wsl,rocm --no-dkms
- install pytorch :
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
- Check installation:
AMD_LOG_LEVEL=7 python3 -c 'import torch; print(torch.cuda.is_available())'
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
WSL environment detected.
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: NO
==========
HSA Agents
==========
*******
Agent 1
*******
Name: CPU
Uuid: CPU-XX
Marketing Name: CPU
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Internal Node ID: 0
Compute Unit: 20
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32758112(0x1f3d960) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32758112(0x1f3d960) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1100
Marketing Name: AMD Radeon RX 7900 GRE
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 16(0x10)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 65536(0x10000) KB
Chip ID: 29772(0x744c)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2052
Internal Node ID: 1
Compute Unit: 80
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 2280
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16691368(0xfeb0a8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
No response
It seems you are running rocm through WSL, can you share your windows driver version as well
It seems you are running rocm through WSL, can you share your windows driver version as well
Certainly.
The current Windows driver version is 24.12.1
Hi.
I thought I missed the document.
Previously I installed PyTorch using the command directly. It used the original pytorch wheel, so it can not work.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
The correct way to install:
wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2.3/torch-2.3.0%2Brocm6.2.3-cp310-cp310-linux_x86_64.whl
wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2.3/torchvision-0.18.0%2Brocm6.2.3-cp310-cp310-linux_x86_64.whl
wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.2.3/pytorch_triton_rocm-2.3.0%2Brocm6.2.3.5a02332983-cp310-cp310-linux_x86_64.whl
pip3 uninstall torch torchvision pytorch-triton-rocm
pip3 install torch-2.3.0+rocm6.2.3-cp310-cp310-linux_x86_64.whl torchvision-0.18.0+rocm6.2.3-cp310-cp310-linux_x86_64.whl pytorch_triton_rocm-2.3.0+rocm6.2.3.5a02332983-cp310-cp310-linux_x86_64.whl
location=`pip show torch | grep Location | awk -F ": " '{print $2}'`
cd ${location}/torch/lib/
rm libhsa-runtime64.so*
cp /opt/rocm/lib/libhsa-runtime64.so.1.2 libhsa-runtime64.so