clpeak and llama.cpp stuck at 100% CPU on 6.8.5 kernel
notsyncing opened this issue · 16 comments
New 6.8.5, 6.8.6 and 6.6.27 LTS kernels are unable to run using the GPU.
It detects and tries to run on the GPU but gets stuck with 100% single CPU core usage.
Happens on any OpenCL or SYCL app. (Kernel 6.8 is using the workaround provided in this thread.)You can downgrade to Linux 6.8.4 for Arch Linux with these packages:
linux 6.8.4: https://archive.archlinux.org/packages/l/linux/linux-6.8.4.arch1-1-x86_64.pkg.tar.zst
linux-headers 6.8.4: https://archive.archlinux.org/packages/l/linux-headers/linux-headers-6.8.4.arch1-1-x86_64.pkg.tar.zst
This happens to me as well on both llama.cpp and clpeak.
clpeak
output:
$ NEOReadDebugKeys=1 PrintDebugMessages=1 LogWaitingForCompletion=1 EventsDebugEnable=1 PrintKmdTimes=1 LogZEInfo=1 clpeak
WARNING: Failed to request OCL Turbo Boost
computeUnitsUsedForScratch: 4096
hwInfo: {512, 4096}: (16, 1, 32)
Platform: Intel(R) OpenCL Graphics
Device: Intel(R) Arc(TM) A770 Graphics
Driver version : 24.09.28717.17 (Linux x64)
Compute units : 512
Clock frequency : 2400 MHz
DeviceBinaryFormat::zebin : Unhandled SHT_NOTE section : .note.intelgt.metrics currently supports only : .note.intelgt.compat.
DeviceBinaryFormat::zebin::.ze_info : Minor version : 40 is newer than available in decoder : 39 - some features may be skipped
Global memory bandwidth (GBPS)
Waiting for task count 0 at location 0x7fb1fbd65000 with timeout 0. Current value: 0
Waiting completed. Current value: 0
Waiting for task count 1 at location 0x7fb1fbd5f000 with timeout 0. Current value: 0
Waiting completed. Current value: 1
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
computeUnits for each thread: 4096
perHwThreadPrivateMemorySize: 0 totalPrivateMemorySize: 0
perHwThreadScratchSize: 0 totalScratchSize: 0
perHwThreadPrivateScratchSize: 0 totalPrivateScratchSize: 0
float : DIM:1 GWS:(33550336, 1, 1) ELWS:(256, 1, 1) Offset:(0, 0, 0) AGWS:(33550336, 1, 1) LWS:(256, 1, 1) TWGS:(131056, 1, 1) NWGS:(131056, 1, 1) SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.
DIM:1 GWS:(33550336, 1, 1) ELWS:(256, 1, 1) Offset:(0, 0, 0) AGWS:(33550336, 1, 1) LWS:(256, 1, 1) TWGS:(131056, 1, 1) NWGS:(131056, 1, 1) SWGS:(0, 0, 0)
devMode = 3, taskMode = 3.
devMode = 3, taskMode = 3.
preemption = 3.
Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 96. Current value: 0
Waiting completed. Current value: 0
Waiting for task count 2 at location 0x7fb1fbd65000 with timeout 0. Current value: 0
Then it stuck here and clpeak
process consumes one cpu core (100% usage).
perf record -a
when it stuck reports:
Samples: 537K of event 'cycles:P', Event count (approx.): 456413846417
Overhead Samples Command Shared Object Symbol
9.08% 36420 clpeak [kernel.kallsyms] [k] clear_bhb_loop ◆
5.43% 21754 clpeak [kernel.kallsyms] [k] __schedule ▒
4.73% 18984 clpeak libc.so.6 [.] __sched_yield ▒
4.65% 18636 clpeak [kernel.kallsyms] [k] _raw_spin_lock ▒
4.64% 18584 clpeak [vdso] [.] __vdso_clock_gettime ▒
4.61% 18482 clpeak [kernel.kallsyms] [k] native_sched_clock ▒
4.48% 17981 clpeak [kernel.kallsyms] [k] psi_account_irqtime ▒
4.01% 16069 clpeak [kernel.kallsyms] [k] update_curr ▒
3.71% 14894 clpeak [kernel.kallsyms] [k] syscall_exit_to_user_mode ▒
3.69% 14808 clpeak [kernel.kallsyms] [k] __calc_delta.constprop.0 ▒
3.66% 14687 clpeak [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack ▒
3.50% 14043 clpeak [kernel.kallsyms] [k] pick_next_task_fair ▒
3.23% 12965 clpeak libigdrcl.so [.] 0x000000000005f724 ▒
2.76% 11064 clpeak [kernel.kallsyms] [k] pick_eevdf ▒
2.61% 10453 clpeak [kernel.kallsyms] [k] do_syscall_64 ▒
2.40% 9623 clpeak [kernel.kallsyms] [k] entry_SYSCALL_64 ▒
2.25% 9021 clpeak [kernel.kallsyms] [k] update_min_vruntime ▒
1.83% 7352 clpeak [kernel.kallsyms] [k] update_curr_se ▒
1.78% 7126 clpeak [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
1.56% 6269 clpeak [kernel.kallsyms] [k] __cgroup_account_cputime ▒
1.52% 6079 clpeak [kernel.kallsyms] [k] record_times ▒
1.51% 6064 clpeak [kernel.kallsyms] [k] update_rq_clock ▒
1.49% 5981 clpeak [kernel.kallsyms] [k] do_sched_yield ▒
1.11% 4453 clpeak [kernel.kallsyms] [k] rcu_note_context_switch ▒
1.01% 4037 clpeak libigdrcl.so [.] 0x000000000005f726 ▒
0.91% 3651 clpeak [kernel.kallsyms] [k] yield_task_fair ▒
0.83% 3346 clpeak [kernel.kallsyms] [k] entry_SYSCALL_64_safe_stack ▒
0.82% 3285 clpeak [kernel.kallsyms] [k] raw_spin_rq_lock_nested ▒
0.78% 3113 clpeak [kernel.kallsyms] [k] syscall_return_via_sysret ▒
0.75% 3008 clpeak [kernel.kallsyms] [k] schedule ▒
0.72% 64724 swapper [kernel.kallsyms] [k] intel_idle ▒
0.70% 2792 clpeak libc.so.6 [.] clock_gettime@@GLIBC_2.17 ▒
0.58% 2338 clpeak [kernel.kallsyms] [k] cpuacct_charge ▒
0.53% 2107 clpeak [kernel.kallsyms] [k] sched_clock ▒
0.51% 2060 clpeak [kernel.kallsyms] [k] sched_clock_cpu ▒
0.45% 1805 clpeak libstdc++.so.6.0.33 [.] std::chrono::_V2::system_clock::now() ▒
0.42% 1690 clpeak [kernel.kallsyms] [k] _raw_spin_unlock ▒
0.39% 1567 clpeak libigdrcl.so [.] 0x0000000000534467 ▒
0.39% 1563 clpeak [kernel.kallsyms] [k] x64_sys_call ▒
0.37% 1469 clpeak [kernel.kallsyms] [k] __list_add_valid_or_report ▒
0.35% 1418 clpeak [kernel.kallsyms] [k] check_cfs_rq_runtime ▒
0.35% 1402 clpeak [kernel.kallsyms] [k] pick_next_entity ▒
0.34% 1368 clpeak [kernel.kallsyms] [k] cgroup_rstat_updated ▒
0.34% 1364 clpeak [kernel.kallsyms] [k] __list_del_entry_valid_or_report ▒
0.27% 1098 clpeak [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare ▒
0.23% 1856 pgrep libc.so.6 [.] __strncpy_evex ▒
0.22% 882 clpeak libigdrcl.so [.] 0x000000000056e33e ▒
0.19% 775 clpeak [kernel.kallsyms] [k] __x64_sys_sched_yield
System information:
kernel 6.8.5-301.fc40.x86_64
Fedora Kinoite 40.20240419.n.0
intel-compute-runtime 24.09.28717.17-1.fc40.x86_64 (this also happens on latest 24.13.29138.7, installed in a ubuntu 22.04 container)
CPU: Intel Core i9-10940X
GPU: Intel Arc A770 16GB
There's a i915 kernel driver regression on CCS (Compute Command Streamer) engine of DG2/Arc GPU.
https://gitlab.freedesktop.org/drm/intel/-/issues/10895
The faulty kernel commit in 6.6.26+ and 6.8.5+ is https://lore.kernel.org/stable/20240327155622.538140-4-andi.shyti@linux.intel.com/T/ And both Media-Driver and OpenCL are affected.
@JablonskiMateusz Could you please let the kernel developer aware of this regression?
This one seems to be happening with any kernel that has Spectre BHI fix in it.
(Not sure if it is related or not and disabling it doesn't change anything.)
Happens with LTS kernels as well.
Happens with compute runtime 24.13 as well.
Kernels with this issue:
- 6.8.5 or newer
- 6.6.26 LTS or newer
Kernels that works fine:
- 6.8.4 or older
- 6.6.25 LTS or older
Haven't tried kernels older than 6.6 LTS.
OS: Arch Linux
Compute Runtime: aur/intel-compute-runtime-bin 24.13.29138.7-1
CPU: AMD Ryzen 7 5800X3D
GPU: Intel ARC A770 16GB
To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly
If this kernel fixes the issue, then it's likely related to the CCS change as well.
To determine if the same issue caused this regression, you can try my custom Arch Linux kernel package with the CCS changes reverted: https://github.com/gnattu/linux/releases/tag/6.8.7-jelly
If this kernel fixes the issue, then it's likely related to the CCS change as well.
That custom kernel works fine. So it is CCS related.
Spectre BHI is just coincidence i guess.
An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4
Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.
An Intel developer has provided me with a series of patches to test the theory of a potential fix without a full revert. I have created an Arch Linux package for it, with the related patches attached:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-ccs_mode-4
Please be aware that this kernel is not a guaranteed fix and is only for testing purposes. If anyone has the time, please try this kernel and let me know if it fixes the issue or not.
This one doesn't work. But it has "gray / io-wait" CPU usage instead of "red / kernel" CPU usage on htop.
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.
This works but performance is 1/4 of what it should be.
I can confirm that the applications no longer hang and got fixed with a kernel built with the new patches on drm-intel-gt-next but compute performance seems to be lowered on my system as well. Not sure if there is something going on with the compute runtime and these new changes or if it is out of scope for this discussion here.
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.This works but performance is 1/4 of what it should be.
Can you try this kernel build to see if the performance improves?
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.This works but performance is 1/4 of what it should be.
Can you try this kernel build to see if the performance improves?
This works fine and performance is pretty much the same as 6.8.4.
Note: I am using a PCI-E 4.0 x4 port with a PCI-E riser so can't really say much about memcpy.
6.8.4 / 6.9.1-custom:
Just got the latest patch from Intel developers:
https://github.com/gnattu/linux/releases/tag/6.8.7-intel-set-ccs-mode-on-reset
Intel developers already reproduced our issue with clpeak, so this kernel has a higher chance of fixing our issue.This works but performance is 1/4 of what it should be.
Can you try this kernel build to see if the performance improves?
Can confirm that this is an improvement. I don't know how to run a benchmark to check memcopy but just comparing to a pretty intense blender file I have lying around, there's a noticeable improvement in the viewport
Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!
Now, with kernel 6.8.10 in fedora 40, it works, though performance is not great. I'm closing this. Thanks for everyone!
@notsyncing Full performance needs the patch mentioned above, drm/i915/gt: Fix CCS id's calculation for CCS mode setting but it has only been merged in kernel 6.10-rc2 so it will take a while to get to Fedora.
I'm on kernel 6.9 and performance is back up to pre 6.8 levels. Would love even more performance, of course, but I'll be surprised if there is another bump after 6.10.
I don't see any change with kernel 6.10 on OpenCL-Benchmark, FWIW.