intel/compute-runtime

clpeak and llama.cpp stuck at 100% CPU on 6.8.5 kernel

notsyncing opened this issue · 16 comments

As #710, @Disty0 writes:

New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.
It detects and tries to run on the GPU but gets stuck with 100% single CPU core usage.
Happens on any OpenCL or SYCL app. (Kernel 6.8 is using the workaround provided in this thread.)

You can downgrade to Linux 6.8.4 for Arch Linux with these packages:
linux 6.8.4: https://archive.archlinux.org/packages/l/linux/linux-6.8.4.arch1-1-x86_64.pkg.tar.zst
linux-headers 6.8.4: https://archive.archlinux.org/packages/l/linux-headers/linux-headers-6.8.4.arch1-1-x86_64.pkg.tar.zst

This happens to me as well on both llama.cpp and clpeak.

clpeak output:

$ NEOReadDebugKeys=1 PrintDebugMessages=1 LogWaitingForCompletion=1 EventsDebugEnable=1 PrintKmdTimes=1 LogZEInfo=1 clpeak
WARNING: Failed to request OCL Turbo Boost
computeUnitsUsedForScratch: 4096
hwInfo: {512, 4096}: (16, 1, 32)

Platform: Intel(R) OpenCL Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 24.09.28717.17 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz
DeviceBinaryFormat::zebin : Unhandled SHT_NOTE section : .note.intelgt.metrics currently supports only : .note.intelgt.compat.
DeviceBinaryFormat::zebin::.ze_info : Minor version : 40 is newer than available in decoder : 39 - some features may be skipped


    Global memory bandwidth (GBPS)

Waiting for task count 0 at location 0x7fb1fbd65000 with timeout 0. Current value: 0

Waiting completed. Current value: 0

Waiting for task count 1 at location 0x7fb1fbd5f000 with timeout 0. Current value: 0

Waiting completed. Current value: 1
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0  totalPrivateMemorySize: 0
perHwThreadScratchSize: 0        totalScratchSize: 0
perHwThreadPrivateScratchSize: 0         totalPrivateScratchSize: 0
      float   : DIM:1   GWS:(33550336, 1, 1)    ELWS:(256, 1, 1)        Offset:(0, 0, 0)        AGWS:(33550336, 1, 1) LWS:(256, 1, 1) TWGS:(131056, 1, 1)     NWGS:(131056, 1, 1)     SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.
DIM:1   GWS:(33550336, 1, 1)    ELWS:(256, 1, 1)        Offset:(0, 0, 0)        AGWS:(33550336, 1, 1)   LWS:(256, 1, 1)       TWGS:(131056, 1, 1)     NWGS:(131056, 1, 1)     SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.

Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 96. Current value: 0

Waiting completed. Current value: 0

Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 0. Current value: 0

Then it stuck here and clpeak process consumes one cpu core (100% usage).

perf record -a when it stuck reports:

Samples: 537K of event 'cycles:P', Event count (approx.): 456413846417
Overhead       Samples  Command          Shared Object                                   Symbol
   9.08%         36420  clpeak           [kernel.kallsyms]                               [k] clear_bhb_loop                                                                                                                                ◆
   5.43%         21754  clpeak           [kernel.kallsyms]                               [k] __schedule                                                                                                                                    ▒
   4.73%         18984  clpeak           libc.so.6                                       [.] __sched_yield                                                                                                                                 ▒
   4.65%         18636  clpeak           [kernel.kallsyms]                               [k] _raw_spin_lock                                                                                                                                ▒
   4.64%         18584  clpeak           [vdso]                                          [.] __vdso_clock_gettime                                                                                                                          ▒
   4.61%         18482  clpeak           [kernel.kallsyms]                               [k] native_sched_clock                                                                                                                            ▒
   4.48%         17981  clpeak           [kernel.kallsyms]                               [k] psi_account_irqtime                                                                                                                           ▒
   4.01%         16069  clpeak           [kernel.kallsyms]                               [k] update_curr                                                                                                                                   ▒
   3.71%         14894  clpeak           [kernel.kallsyms]                               [k] syscall_exit_to_user_mode                                                                                                                     ▒
   3.69%         14808  clpeak           [kernel.kallsyms]                               [k] __calc_delta.constprop.0                                                                                                                      ▒
   3.66%         14687  clpeak           [kernel.kallsyms]                               [k] entry_SYSRETQ_unsafe_stack                                                                                                                    ▒
   3.50%         14043  clpeak           [kernel.kallsyms]                               [k] pick_next_task_fair                                                                                                                           ▒
   3.23%         12965  clpeak           libigdrcl.so                                    [.] 0x000000000005f724                                                                                                                            ▒
   2.76%         11064  clpeak           [kernel.kallsyms]                               [k] pick_eevdf                                                                                                                                    ▒
   2.61%         10453  clpeak           [kernel.kallsyms]                               [k] do_syscall_64                                                                                                                                 ▒
   2.40%          9623  clpeak           [kernel.kallsyms]                               [k] entry_SYSCALL_64                                                                                                                              ▒
   2.25%          9021  clpeak           [kernel.kallsyms]                               [k] update_min_vruntime                                                                                                                           ▒
   1.83%          7352  clpeak           [kernel.kallsyms]                               [k] update_curr_se                                                                                                                                ▒
   1.78%          7126  clpeak           [kernel.kallsyms]                               [k] entry_SYSCALL_64_after_hwframe                                                                                                                ▒
   1.56%          6269  clpeak           [kernel.kallsyms]                               [k] __cgroup_account_cputime                                                                                                                      ▒
   1.52%          6079  clpeak           [kernel.kallsyms]                               [k] record_times                                                                                                                                  ▒
   1.51%          6064  clpeak           [kernel.kallsyms]                               [k] update_rq_clock                                                                                                                               ▒
   1.49%          5981  clpeak           [kernel.kallsyms]                               [k] do_sched_yield                                                                                                                                ▒
   1.11%          4453  clpeak           [kernel.kallsyms]                               [k] rcu_note_context_switch                                                                                                                       ▒
   1.01%          4037  clpeak           libigdrcl.so                                    [.] 0x000000000005f726                                                                                                                            ▒
   0.91%          3651  clpeak           [kernel.kallsyms]                               [k] yield_task_fair                                                                                                                               ▒
   0.83%          3346  clpeak           [kernel.kallsyms]                               [k] entry_SYSCALL_64_safe_stack                                                                                                                   ▒
   0.82%          3285  clpeak           [kernel.kallsyms]                               [k] raw_spin_rq_lock_nested                                                                                                                       ▒
   0.78%          3113  clpeak           [kernel.kallsyms]                               [k] syscall_return_via_sysret                                                                                                                     ▒
   0.75%          3008  clpeak           [kernel.kallsyms]                               [k] schedule                                                                                                                                      ▒
   0.72%         64724  swapper          [kernel.kallsyms]                               [k] intel_idle                                                                                                                                    ▒
   0.70%          2792  clpeak           libc.so.6                                       [.] clock_gettime@@GLIBC_2.17                                                                                                                     ▒
   0.58%          2338  clpeak           [kernel.kallsyms]                               [k] cpuacct_charge                                                                                                                                ▒
   0.53%          2107  clpeak           [kernel.kallsyms]                               [k] sched_clock                                                                                                                                   ▒
   0.51%          2060  clpeak           [kernel.kallsyms]                               [k] sched_clock_cpu                                                                                                                               ▒
   0.45%          1805  clpeak           libstdc++.so.6.0.33                             [.] std::chrono::_V2::system_clock::now()                                                                                                         ▒
   0.42%          1690  clpeak           [kernel.kallsyms]                               [k] _raw_spin_unlock                                                                                                                              ▒
   0.39%          1567  clpeak           libigdrcl.so                                    [.] 0x0000000000534467                                                                                                                            ▒
   0.39%          1563  clpeak           [kernel.kallsyms]                               [k] x64_sys_call                                                                                                                                  ▒
   0.37%          1469  clpeak           [kernel.kallsyms]                               [k] __list_add_valid_or_report                                                                                                                    ▒
   0.35%          1418  clpeak           [kernel.kallsyms]                               [k] check_cfs_rq_runtime                                                                                                                          ▒
   0.35%          1402  clpeak           [kernel.kallsyms]                               [k] pick_next_entity                                                                                                                              ▒
   0.34%          1368  clpeak           [kernel.kallsyms]                               [k] cgroup_rstat_updated                                                                                                                          ▒
   0.34%          1364  clpeak           [kernel.kallsyms]                               [k] __list_del_entry_valid_or_report                                                                                                              ▒
   0.27%          1098  clpeak           [kernel.kallsyms]                               [k] syscall_exit_to_user_mode_prepare                                                                                                             ▒
   0.23%          1856  pgrep            libc.so.6                                       [.] __strncpy_evex                                                                                                                                ▒
   0.22%           882  clpeak           libigdrcl.so                                    [.] 0x000000000056e33e                                                                                                                            ▒
   0.19%           775  clpeak           [kernel.kallsyms]                               [k] __x64_sys_sched_yield                 

System information:

kernel 6.8.5-301.fc40.x86_64
Fedora Kinoite 40.20240419.n.0
intel-compute-runtime 24.09.28717.17-1.fc40.x86_64 (this also happens on latest 24.13.29138.7, installed in a ubuntu 22.04 container)

CPU: Intel Core i9-10940X
GPU: Intel Arc A770 16GB

There's a i915 kernel driver regression on CCS (Compute Command Streamer) engine of DG2/Arc GPU.

https://gitlab.freedesktop.org/drm/intel/-/issues/10895

The faulty kernel commit in 6.6.26+ and 6.8.5+ is https://lore.kernel.org/stable/20240327155622.538140-4-andi.shyti@linux.intel.com/T/ And both Media-Driver and OpenCL are affected.

@JablonskiMateusz Could you please let the kernel developer aware of this regression?

This one seems to be happening with any kernel that has Spectre BHI fix in it.
(Not sure if it is related or not and disabling it doesn't change anything.)

Happens with LTS kernels as well.
Happens with compute runtime 24.13 as well.

Kernels with this issue:

  • 6.8.5 or newer
  • 6.6.26 LTS or newer

Kernels that works fine:

  • 6.8.4 or older
  • 6.6.25 LTS or older

Haven't tried kernels older than 6.6 LTS.

OS: Arch Linux
Compute Runtime: aur/intel-compute-runtime-bin 24.13.29138.7-1

CPU: AMD Ryzen 7 5800X3D
GPU: Intel ARC A770 16GB

To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly

If this kernel fixes the issue, then it's likely related to the CCS change as well.

To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly

If this kernel fixes the issue, then it's likely related to the CCS change as well.

That custom kernel works fine. So it is CCS related.
Spectre BHI is just coincidence i guess.

An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4

Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.

An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4

Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.

This one doesn't work. But it has "gray / io-wait" CPU usage instead of "red / kernel" CPU usage on htop.

Just got the latest patch from Intel developers:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset

Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

Just got the latest patch from Intel developers:

https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset

Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

I can confirm that the applications no longer hang and got fixed with a kernel built with the new patches on drm-intel-gt-next but compute performance seems to be lowered on my system as well. Not sure if there is something going on with the compute runtime and these new changes or if it is out of scope for this discussion here.

Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

Can you try this kernel build to see if the performance improves?

https://github.com/gnattu/linux/releases/tag/6.9.1-jelly

Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

Can you try this kernel build to see if the performance improves?

https://github.com/gnattu/linux/releases/tag/6.9.1-jelly

This works fine and performance is pretty much the same as 6.8.4.
Note: I am using a PCI-E 4.0 x4 port with a PCI-E riser so can't really say much about memcpy.

6.8.4 / 6.9.1-custom:

image

Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.

This works but performance is 1/4 of what it should be.

Can you try this kernel build to see if the performance improves?

https://github.com/gnattu/linux/releases/tag/6.9.1-jelly

Can confirm that this is an improvement. I don't know how to run a benchmark to check memcopy but just comparing to a pretty intense blender file I have lying around, there's a noticeable improvement in the viewport

Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!

Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!

@notsyncing Full performance needs the patch mentioned above, drm/i915/gt: Fix CCS id's calculation for CCS mode setting but it has only been merged in kernel 6.10-rc2 so it will take a while to get to Fedora.

I'm on kernel 6.9 and performance is back up to pre 6.8 levels. Would love even more performance, of course, but I'll be surprised if there is another bump after 6.10.

I don't see any change with kernel 6.10 on OpenCL-Benchmark, FWIW.