ROCm/ROCm-OpenCL-Runtime

clGetDeviceIDs() returns CL_DEVICE_NOT_FOUND after upgrade from 2.10 to 3.1

SailingDreams opened this issue · 1 comments

I recently upgraded from 2.10 to 3.10

$ apt show rocm-libs -a
Package: rocm-libs
Version: 3.1.44 

and unfortunately clinfo says it cannot find the device:

Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 AMD-APP (3084.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:				 AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)

I've tried returning to 2.10 as per instructions in another thread: 2.10 to 3.0 upgrade
however, after returning to 2.10, clinfo segmentation faults on

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
0x00007ffff6ded8b0 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so

So giving up on the 2.10 approach, I removed all rocm sudo apt autoremove rocm-dkms and followed the 3.1 installation instructions and rebooted several times.

rocminfo appears to see the CPUs $ /opt/rocm/bin/rocminfo rocminfo.txt but clinfo cannot see my device.

So I'm stuck at the moment without a system to run opencl. Any thoughts on what to try next would be greatly appreciated.

CPU

$ uname -a
Linux ripper 5.3.0-42-generic #34~18.04.1-Ubuntu SMP Fri Feb 28 13:42:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

GPU

$ sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: Vega 10 XT [Radeon RX Vega 64]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:45:00.0
       version: c1
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: iomemory:4800-47ff iomemory:47f0-47ef irq:105 memory:48000000000-481ffffffff memory:47f00000000-47f001fffff ioport:8000(size=256) memory:92300000-9237ffff memory:92380000-9239ffff

After working on this all day yesterday, this morning I uninstalled rocm sudo apt autoremove rocm-dkms rock-dkms and tried amdgpu-pro. To my surprise, clinfo now finds my devices!
clinfo_works.txt

It would have been nice to figure out what was wrong with rocm 3.1, but I've got a deliverable due this week.

I noticed that the autoremove didn't delete the /opt/rocm-3.1.0 . I assume it's ok to manually delete, but I'm so shell shocked by this experience that I'm reluctant to do so.