ROCm/ROCm-OpenCL-Runtime

gfx1030 does not show up as OpenCL device

FluxusMagna opened this issue · 6 comments

The card shows up in lshw and rocminfo, but clinfo shows 0 devices for the AMD platform. I previously had a discussion(rocm-arch/rocm-arch#768) with the arch-linux package maintainers(@acxz) and concluded that it seems to involve upstream code.

clinfo output:

$ /opt/rocm/bin/clinfo
Number of platforms:				 3
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 3.0 CUDA 11.6.127
  Platform Name:				 NVIDIA CUDA
  Platform Vendor:				 NVIDIA Corporation
  Platform Extensions:				 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 LINUX
  Platform Name:				 Intel(R) CPU Runtime for OpenCL(TM) Applications
  Platform Vendor:				 Intel(R) Corporation
  Platform Extensions:				 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 AMD-APP (3423.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback


  Platform Name:				 NVIDIA CUDA
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 10deh
  Max compute units:				 13
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 64
  Max work group size:				 1024
  Preferred vector width char:			 1
  Preferred vector width short:			 1
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 1
  Native vector width short:			 1
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 772Mhz
  Address bits:					 64
  Max memory allocation:			 2128281600
  Image support:				 Yes
  Max number of images read arguments:		 256
  Max number of images write arguments:		 16
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 4096
  Max image 3D height:				 4096
  Max image 3D depth:				 4096
  Max samplers within kernel:			 32
  Max size of kernel argument:			 4352
  Alignment (bits) of base address:		 4096
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 128
  Cache size:					 638976
  Global memory size:				 8513126400
  Constant buffer size:				 65536
  Max number of constant args:			 9
  Local memory type:				 Scratchpad
  Local memory size:				 49152
  Max pipe arguments:				 0
  Max pipe active reservations:			 0
  Max pipe packet size:				 0
  Max global variable size:			 0
  Max global variable preferred total size:	 0
  Max read/write image args:			 0
  Max on device events:				 0
  Queue on device max size:			 0
  Max on device queues:				 0
  Queue on device preferred size:		 0
  SVM capabilities:				
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 No
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 32
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1000
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Queue on Device properties:				
    Out-of-Order:				 No
    Profiling :					 No
  Platform ID:					 0x562e4f9968c0
  Name:						 Quadro M4000
  Vendor:					 NVIDIA Corporation
  Device OpenCL C version:			 OpenCL C 1.2
  Driver version:				 510.60.02
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 3.0 CUDA
  Extensions:					 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd


  Platform Name:				 Intel(R) CPU Runtime for OpenCL(TM) Applications
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_CPU
  Vendor ID:					 8086h
  Max compute units:				 80
  Max work items dimensions:			 3
    Max work items[0]:				 8192
    Max work items[1]:				 8192
    Max work items[2]:				 8192
  Max work group size:				 8192
  Preferred vector width char:			 1
  Preferred vector width short:			 1
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 32
  Native vector width short:			 16
  Native vector width int:			 8
  Native vector width long:			 4
  Native vector width float:			 8
  Native vector width double:			 4
  Max clock frequency:				 2200Mhz
  Address bits:					 64
  Max memory allocation:			 16858395648
  Image support:				 Yes
  Max number of images read arguments:		 480
  Max number of images write arguments:		 480
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 480
  Max size of kernel argument:			 3840
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 No
    Round to +ve and infinity:			 No
    IEEE754-2008 fused multiply-add:		 No
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 262144
  Global memory size:				 67433582592
  Constant buffer size:				 131072
  Max number of constant args:			 480
  Local memory type:				 Global
  Local memory size:				 32768
  Max pipe arguments:				 16
  Max pipe active reservations:			 3276
  Max pipe packet size:				 1024
  Max global variable size:			 65536
  Max global variable preferred total size:	 65536
  Max read/write image args:			 480
  Max on device events:				 4294967295
  Queue on device max size:			 4294967295
  Max on device queues:				 4294967295
  Queue on device preferred size:		 4294967295
  SVM capabilities:				
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 Yes
    Fine grain system:				 Yes
    Atomics:					 Yes
  Preferred platform atomic alignment:		 64
  Preferred global atomic alignment:		 64
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 128
  Error correction support:			 0
  Unified memory for Host and Device:		 1
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				
    Execute OpenCL kernels:			 Yes
    Execute native function:			 Yes
  Queue on Host properties:				
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Queue on Device properties:				
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Platform ID:					 0x562e4f9661e0
  Name:						 Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  Vendor:					 Intel(R) Corporation
  Device OpenCL C version:			 OpenCL C 2.0
  Driver version:				 18.1.0.0920
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 2.1 (Build 0)
  Extensions:					 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer cl_intel_vec_len_hint


  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 0

rocminfo output:

$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  Uuid:                    CPU-XX
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3600
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            40
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32838116(0x1f511e4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32838116(0x1f511e4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32838116(0x1f511e4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  Uuid:                    CPU-XX
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3600
  BDFID:                   0
  Internal Node ID:        1
  Compute Unit:            40
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    33014992(0x1f7c4d0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33014992(0x1f7c4d0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33014992(0x1f7c4d0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 3
*******
  Name:                    gfx1030
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 6800 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29631(0x73bf)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2575
  BDFID:                   1280
  Internal Node ID:        2
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16760832(0xffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

The likely problem was traced back to

https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/bbdc87e08b322d349f82bdd7575c8ce94d31d276/tools/clinfo/clinfo.cpp#L124

and then

https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/bbdc87e08b322d349f82bdd7575c8ce94d31d276/tools/clinfo/clinfo.cpp#L115

Make sure you have the latest ROCm installation. gfx1030 support is enabled in the 5.1 branch (see https://github.com/ROCm-Developer-Tools/ROCclr/blob/rocm-5.1.x/device/device.cpp#L183).

I am using 5.1.1 so that should not be the issue.

acxz commented

@vsytch it would be helpful if you could point us to the logic of platform.getDevices to help us trace this down. As in how (and where in the code, specifically) are the devices queried from the hardware?

Yeah I noticed this too. My Raven APU shows up, but I don't see the gfx1030.

Can you run:

AMD_LOG_LEVEL=4 clinfo

I just realised that my install of compiler (clang/comgr/llvm) was messed up. E.g. it was trying to use comgr with an older clang/llvm somehow, so it obviously failed. There might be an issue with the rocm-arch packages. By default comgr should statically link against clang, but it is possible to dynamically link it, which is what I did.

That sounds very plausible. I got it to work with repackaged Ubuntu packages, so it is likely something related to the arch packages. I forgot to post that here though. At the moment I have no quick way to test the hypothesis as the machine is currently in use for relatively urgent work, but you can close the issue if you see fit.