clGetDeviceIDs() returns CL_DEVICE_NOT_FOUND after upgrade from 2.10 to 3.1
SailingDreams opened this issue · 1 comments
I recently upgraded from 2.10 to 3.10
$ apt show rocm-libs -a
Package: rocm-libs
Version: 3.1.44
and unfortunately clinfo says it cannot find the device:
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3084.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)
I've tried returning to 2.10 as per instructions in another thread: 2.10 to 3.0 upgrade
however, after returning to 2.10, clinfo segmentation faults on
Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff6ded8b0 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
So giving up on the 2.10 approach, I removed all rocm sudo apt autoremove rocm-dkms
and followed the 3.1 installation instructions and rebooted several times.
rocminfo appears to see the CPUs $ /opt/rocm/bin/rocminfo
rocminfo.txt but clinfo cannot see my device.
So I'm stuck at the moment without a system to run opencl. Any thoughts on what to try next would be greatly appreciated.
CPU
$ uname -a
Linux ripper 5.3.0-42-generic #34~18.04.1-Ubuntu SMP Fri Feb 28 13:42:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
GPU
$ sudo lshw -C display
*-display
description: VGA compatible controller
product: Vega 10 XT [Radeon RX Vega 64]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:45:00.0
version: c1
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:4800-47ff iomemory:47f0-47ef irq:105 memory:48000000000-481ffffffff memory:47f00000000-47f001fffff ioport:8000(size=256) memory:92300000-9237ffff memory:92380000-9239ffff
After working on this all day yesterday, this morning I uninstalled rocm sudo apt autoremove rocm-dkms rock-dkms
and tried amdgpu-pro. To my surprise, clinfo now finds my devices!
clinfo_works.txt
It would have been nice to figure out what was wrong with rocm 3.1, but I've got a deliverable due this week.
I noticed that the autoremove didn't delete the /opt/rocm-3.1.0 . I assume it's ok to manually delete, but I'm so shell shocked by this experience that I'm reluctant to do so.