intel/compute-runtime

NEO driver not detect GPU when using kernel 6.8.x.

ionutnechita-intel opened this issue ยท 45 comments

NEO driver is not detect for GPU when using kernel 6.8.x.

When have kernel 6.5.x and 6.6.x this is present.

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [24.05.28454.6]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28454]

And on kernel 6.8.x have this:

/opt/intel/oneapi/compiler/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]

I can reproduce this with latest drm-tip 6.8.0-rc6 kernel, using earlier built (2024-02-09) compute-runtime master branch, or earlier compute-runtime releases => Neither clinfo nor zello_sysman recognizes the GPU. vainfo / vpl-inspect media tools still recognize the GPU though, so it's compute stack specific issue.

I do not see any difference in strace output (between old an new kernels) before compute-runtime decides to give up, so it's a bit mystery why it decides not to recognize the GPU.

Thank you for reproduced this.

On 6.7.x, GPU is recognized.
Only 6.8.x is not recognized.

Yes, it works with 6.7 (drm-tip) kernel also for me, just not with 6.8 (i915 KMD).

EDIT: that was with public Xe KMD repo, not drm-tip. With drm-tip, the issue is already with earlier kernel version (see below).

I tested with 6.8.0-rc1(6.8.0-060800rc1-generic) and this issue is reproduced.

Maybe between 6.7 and 6.8.0-rc1 appear this issue.

I notice several commits with new Xe Intel driver and fixing eDP/DisplayPort in 6.8.0-rc1.

I not have time to bisect for detect what commit/commits cause this behaviour.

Dang. I was comparing "drm-tip" on TGL against "xe-drm-next" kernel on DG1, but their i915 KMD codes seem to progress at different rates, so I had to do quick bisection using already existing nightly "drm-tip" builds...

While things work still with 6.7 version of "xe-drm-next" kernel repo, with the "drm-tip" repo kernel, clinfo & zello_sysman actually broke already earlier, somewhere between couple of "drm-tip" repo upstream 6.6-rc7 kernel integration changes:

  • drm-tip: 2023y-10m-29d-09h-52m-45s UTC integration manifest
  • drm-tip: 2023y-10m-31d-13h-47m-12s UTC integration manifest

(Commits named like those, or the original commits are not any more in "drm-tip" repo, as it gets constantly rebased to upstream, so I cannot provide list of commits between them any more.)

Hi folks,
we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE. As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

we also observe issue with 6.8 kernel - i915 reports different I915_CONTEXT_PARAM_GTT_SIZE.

Media and 3D drivers seem to work fine with that change, why it's a problem for L0/compute stack?

(I'm wondering whether this change should be reported to upstream as kernel stable ABI breakage...)

Looking at the compute-runtime code, it seems to affect SVM capability & address space size:
https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/linux/product_helper_drm.cpp#L128

Where's in Mesa code:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/intel/vulkan/anv_device.c#L2300

As a workaround could you try to run application with additional env - NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 ?

Yes, with those both clinfo & zello_sysman work just fine (on TGL-H iGPU).

Hi @eero-t,

Using latest drm-tip version with variable in environment, GPU appear.

# /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:2] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
# NEOReadDebugKeys=1 OverrideGpuAddressSpace=48 /opt/intel/oneapi/2024.0/bin/sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [23.13.026032]
[opencl:cpu:2] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:acc:3] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.26032]
# uname -a
Linux 6.8.0-rc6-lowlatency1 #1 SMP PREEMPT_DYNAMIC Fri Mar  1 09:38:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
# lscpu | grep "Model name"
Model name:                         11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz

In this case issue is from Kernel or NEO driver/OpenCL?

Well, it depends the GTT size value returned by the KMD is thought to be part of stable ABI, but I do not see how it could be, as there can be different reasons for those values to differ. I would think that NEO should accept / adapt to sensible GTT size values, potentially with a warning when it differs from expected, instead of barfing out when it's not exactly matching its expectations.

Tested 6.8.0-rc3 based Xe KMD, and compute/Sysman driver worked with that, so this issue seems to be i915 KMD specific (as expected).

I can reproduce this on Arch

I can reproduce this on Arch with Linux 6.8 release (6.8.1-arch1-1) using i915.
Haven't tried xe yet.

Exporting these works fine:

export NEOReadDebugKeys=1
export OverrideGpuAddressSpace=48

In this case, will the NEO compute driver have adaptation to working on new behaviour?

DX37 commented

Encountered this issue also.

On 6.8:

gpuAddressSpace = 281474976706559
= 111111111111111111111111111111111110111111111111

On 6.7:

gpuAddressSpace = 281474976710655
 = 111111111111111111111111111111111111111111111111

The issue seems to lie here:

if (cpuVirtualAddressSize == 48 && gpuAddressSpace == maxNBitValue(48)) {
gfxBase = maxNBitValue(48 - 1) + 1;
heapInit(HeapIndex::heapSvm, 0ull, gfxBase);
} else if (gpuAddressSpace == maxNBitValue(47)) {

In this case, will the NEO compute driver have adaptation to working on new behaviour?

It seems that change in value reported by the GTT size ioctl() may be reverted in i915 kernel driver: https://patchwork.freedesktop.org/series/131095/

(I.e. KMD would only internally use the "usable" GTT size value, and report full address space to user space, including the reserved parts, and distros using 6.8.0 kernel need to patch their kernels until upstream releases updated kernel.)

@JablonskiMateusz Maybe compute-runtime could do some BAT tests also with latest drm-tip kernel, to catch such changes before they are sent to upstream kernel? This change was in drm-tip repo i915 KMD already in 6.7...

Note that the upcoming Ubuntu 24.04 LTS uses the non-LTS 6.8 kernel. Hopefully it can be fixed before it's released next month. Otherwise OpenCL will not be available on many distros based on it.

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

rusticl-mesa actually still works fine in my testing, even though intel-compute-runtime doesn't work at all

rusticl is still an experimental implementation and according to Mesa it is currently broken on Arc GPUs. My use case is video processing and only NEO supports zero-copy interop between VA-API and OpenCL through cl_intel_va_api_media_sharing.

Just adding as well that I'm also experiencing this issue on nixos when running the latest kernel (6.8.1). GPU (intel N100 alder lake) does not show up in clinfo.

However, on a N5105 machine (Jasper Lake), the GPU did get detected by clinfo on the latest kernel.

However downgrading to 6.7.10 on the N100 machine immediately resolved the issue.

Good news folks, we are going to adjust the logic on UMD side so we can accept new gtt size reported by i915 ;)

This is good news.

could you retry with neo built with this commit 420e139?

could you retry with neo built with this commit 420e139?

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

FYI: @tjaalton Ubuntu 24.04 LTS is also having a 6.8+ kernel, so its compute-runtime packages needs this too.

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Tested with: Ubuntu 24.04 Alpha. Linux Kernel 6.8.4-lowlatency. TGL: 11th Gen Intel(R) Core(TM) i7-1185GRE @ 2.80GHz

  • 6.8.5-lowlatency kernel version(new behaviour is change with this version).

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.
It detects and tries to run on the GPU but gets stuck with 100% single CPU core usage.
Happens on any OpenCL or SYCL app. (Kernel 6.8 is using the workaround provided in this thread.)

You can downgrade to Linux 6.8.4 for Arch Linux with these packages:
linux 6.8.4: https://archive.archlinux.org/packages/l/linux/linux-6.8.4.arch1-1-x86_64.pkg.tar.zst
linux-headers 6.8.4: https://archive.archlinux.org/packages/l/linux-headers/linux-headers-6.8.4.arch1-1-x86_64.pkg.tar.zst

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.

@Disty0 If issue happens also with 6.6 kernel, I do not think it to be related to this issue => please file a separate one, and report also compute-runtime version, and where perf reports CPU usage to happen (run as root):

# perf record -a
<wait a min or two>
^C
# perf report -n

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly newer tag includes actual fix:
24.09.28717.12...24.09.28717.14

Release: https://github.com/intel/compute-runtime/releases/tag/24.09.28717.12

Um, its release notes mention it still needing the env var workaround?

Slightly new tag includes actual fix: 24.09.28717.12...24.09.28717.14

Right. I was trying to see why 24.09.28717.12 still didn't work for me and read your reply.
Thanks. This saved me time.

I applied this commit on top of the version currently shipped by Arch Linux (23.48.27912.11) and it fixed the problem with my i5-7200U iGPU, now clinfo is able to detect it and I could successfully run some admittedly simple OpenCL programs on Linux 6.8.2 (without any extra environment variables).

FYI: @tjaalton Ubuntu 24.04 LTS is also having a 6.8+ kernel, so its compute-runtime packages needs this too.

uploaded the fix to noble, thanks for the ping

This issue seems to be fixed with aur/intel-compute-runtime-bin 24.13.29138.7-1 on my end. (Arch Linux 6.8.4)

since issue seems to be fixed, can we now close the issue?

Hello @JablonskiMateusz ,

I think this issue is fixed now.

Maybe is fine to close this ticket.

@ionutnechita-intel Sorry, but this doesn't work inside an OCI container with podman for whatever reason. Not sure if it is also an issue with Docker but I would presume it would be a problem as well. You have to export the two environment variables NEOReadDebugKeys=1 and OverrideGpuAddressSpace=48 for the GPU to be seen inside the container but not on the host machine. I don't know if you want to consider it the same bug but if not, I can open a new bug report for this.

@simonlui Are you sure that the version of the Intel Compute Runtime installed inside the container contains the fix? I can imagine your situation happening if this were not the case. For reference, my iGPU appears to be correctly detected by clinfo inside an Arch Linux-based container.

@joanbm Yeah that was it. I was confused why I was hitting this in the oneapi-basekit Docker image but it was last updated a month ago at the time of writing this so it makes sense why it still had the issue without the updated version of the runtime inside the container.

@JablonskiMateusz When will this fix be posted to the apt repo at https://repositories.intel.com/gpu/ubuntu?

Hi @simonlui,

I understand what you are saying. but it must be checked more thoroughly, with several OS variants as a container.

I tested it on Ubuntu 24.04, directly on the physical machine, with the latest update, and I didn't see the problem anymore.

@ionutnechita-intel The problem was fixed, it was an outdated compute runtime package inside the oneapi-basekit Docker image which didn't have the updated runtime installed by default. Updating the package manually fixed the issue.

Hi @simonlui,

Thank you for feedback.

A good day.

I am having the same issue with Rocky Linux. When I upgraded from 9.2 to 9.4, I can no longer see the Arc GPU in the clinfo. I see my Arc 750 in "lspci" but not in clinfo and I cannot run codes on it.
My username is part of the "render" group and I have the Redhat 9.3 driver installed (the latest one I could find) along with OneAPI HPC toolkit 2024.2.

If I use the two environment variables above, it works! (this is the first fix I have found).

Will this be fixed in the next driver release that supports RHEL 9.4?