lkrg-org/lkrg

High CPU load on Debian 12 VM caused by LKRG

Opened this issue · 7 comments

gnd commented

Hello,

we recently upgraded some of our VMs to Debian 12. They are used to run php8.2 for some web apps. However as soon as we recompiled LKRG with the new kernel and started it, we noticed the CPU reaching 100% very fast. This leads to a machine lockups and the apps become slow and non-responsive.

We tried many things but it seems like LKRG is the issue. Once started the load reached 100% very fast, once turned off, load falls back to normal within a minute. We run LKRG on dozens of machines but only the ones running Debian 12 AND php have this issue. Older Debian machines with PHP and LKRG have no problem. So do machines that do not run PHP workloads.

We tried fiddling with the module's parameters, eg. krg.profile_validate setting it via sysctl all the way to 0, but this didnt help.

We also tried looking into older LKRG releases and run them - with the same result (specifically it was 7db7483). In the current state we cant run LKRG, even tho we would like to have it :(

Do you have any ideas what might be wrong, or how to help you debug this issue ? Thanks !

Attached is an screenshot from Grafana, showing the effect of re/enabling LKRG 3 times in a row.

lkrg_load

Thank you for reporting this @gnd. My main two guesses as to what could be causing this are:

  1. Too frequent kernel integrity verification, which LKRG by default does not only periodically, but also on "random events". However, if you did in fact tried lowering lkrg.profile_validate all the way to 0 and that didn't help, this guess is ruled out. You may want to double-check, though, by setting lkrg.kint_validate to a lower value (it should be sufficient to lower it from 3 to 2, but you can also try 0).

  2. Too frequent updates of the kernel's code. The kernel uses self-modifying code for so-called "jump labels", and LKRG keeps track of that. In fact, currently LKRG does so even when lkrg.kint_validate is 0, so that you'd be able to switch from 0 to non-0 later. Maybe we need to add a mode where such tracking is also disabled, or just disable it at 0 and either don't allow switching to non-0 without LKRG reload or perform hash recalculation when switching from 0 to non-0. Maybe we also need to add a way to update hashes to reflect a "jump label" change quickly, without full recalculation, although for that we'd have to use weaker hashing or a large number of hashes (e.g., one hash per 4 KiB).

Per your analysis so far, this is more likely issue 2 above.

It's puzzling that PHP causes this. It's also puzzling that a "jump label" would presumably be switching back and forth - normally, these are only switched once or very infrequently (on changes to kernel runtime configuration via sysctl or such). This could indicate a minor kernel bug, where what was meant to be an optimization ended up the other way around, since even without LKRG updating the kernel code has some performance cost.

Is it possible to see the list of all processes while you have such a spike of CPU usage? If the problem is related to JUMP_LABEL we should see a spikes related to kernel worker threads

gnd commented

Hello, unfortunately, if you mean the number of kworker processes, their numbers remained the same. Here is a log:

# ps -ef|grep kworker|grep -v grep|wc -l
37
# systemctl start lkrg; sleep 240; ps -ef|grep kworker|grep -v grep|wc -l
39
# w
 10:28:33 up 5 days, 14:28,  3 users,  load average: 143.80, 74.00, 30.48
# systemctl stop lkrg

I think Adam meant not the number of those processes, but whether they're the ones actively running on CPU (e.g. per top) during the load spikes. Anyway, you show that the number of kworker processes is way lower than the load average, suggesting that there are many other processes in running state. It would be helpful to see the output of ps axo pid,pcpu,stat,time,wchan:30,comm k -pcpu during one of those load spikes.

gnd commented

Hello, attached are two files, One before enabling LKRG, second one after LKRG is enabled, when load reached > 100.

ps_before.txt
ps_after.txt

Thanks @gnd. This is puzzling. We really need the WCHAN field to hopefully figure it out. I don't know why exactly it is empty for you, but perhaps you need to run ps with greater privileges?

gnd commented

This might be because of some custom sysctl settings .. let me check