rickysarraf/laptop-mode-tools

laptop mode consumes 100% CPU

Closed this issue · 16 comments

I've got laptop-mode-tools 1.72.2-3 installed via Arch's AUR, and I often notice the /usr/bin/laptop_mode command consuming 100% CPU indefinitely and it's somehow impervious to kill -9. Trying to start/stop it via systemctl stop laptop-mode also seems to have no effect.

My level of familiarity with the particulars of laptop-mode is fairly minimal other than having a laptop and wanting it to "do the right thing", but I'm willing to help troubleshoot in any way I can.

System info:
Lenovo ThinkPad T440p
4.17.5-1-ARCH #1 SMP PREEMPT Sun Jul 8 17:27:31 UTC 2018 x86_64 GNU/Linux

fam4r commented

I have the same problem.
It occurs when I unplug AC, 100% CPU, no way to spawn a terminal or everything else.

System info:
Lenovo Y50-70
4.17.8-1-ARCH #1 SMP PREEMPT Wed Jul 18 09:56:24 UTC 2018 x86_64 GNU/Linux

If log needed, please ask which one

I am yet to see such scenario locally on my machines. UnInterruptible Sleep may apply to processes under various catastrophic scenarios but none relate to LMT itself.

To know more, we need to dig. I definitely need full journal logs for the LMT service.

Does you machine go permanently in a hung state ? LIke this one: #123

Or is it just a sudden spike and then it settles down eventually?

My machine doesn't hang permanently, nor does it really settle down. One core is maxed at 100% and the fan cranks up, but it is otherwise mostly responsive. I'll post logs when it happens next.

Okay, I caught it happening again. I noticed CPU usage spike to 100% yesterday evening, but I'm not sure exactly when it happened. Eventually the laptop shut down when the battery died.

Here's the output of journalctl -u laptop-mode.service -b -1.

And here's the full output from journalctl -b -1.

I see what looks like the kernel repeatedly reporting that the laptop_mode task refuses to freeze after I closed the lid at Aug 05 21:29:43, but that was when I was done for the night and the CPU had already been pegged for at least an hour before that.

Nothing in the logs show anything related to Laptop Mode Tools.


Aug 05 21:35:44 laika NetworkManager[1120]: <info>  [1533522944.1240] device (wlp4s0): state change: unavailable -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Aug 05 21:35:44 laika systemd-sleep[13528]: Failed to write /sys/power/state: Device or resource busy

The problem is your machine's swsusp functionality, which is broken. The sleep tasks are queued, which seem to cause for the spike in CPU

Thanks, I'll look into fixing my swsusp setup. I'm not exactly sure how these sleep tasks work, but is there a reason it would cause only /usr/bin/laptop_mode to spin at 100% in an un-killable state? Just trying to understand how these things interrelate here; I'd be happy to submit a PR to update the FAQ once I get it resolved.

The way swsusp works is to freeze all tasks and then lock the memory, and then depending upon the request; i.e. STD or STR, write to the disk or just sleep in the latter case.

Now when swsusp has issues, your machine can break at any of the above phases. Freezing tasks, locking memory, writing to disk, and even resuming/unfreezing tasks.

Why does laptop_mode end up showing as the common culprit ?
That's because the swsusp freeze and unfreeze tasks are system level, which generate kernel events that LMT acts upon. So when the event is generated, LMT acts on it. But the kernel, at that point of time, is already screwed up. It has a large queue of frozen tasks that it is trying to recover from.

Hence the reason why you are seeing a 100% CPU spike.

For your specific case, it is going to be a daunting task to find the actual culprit. Your best bet is to use a newer kernel and hope that this issue is not bound to any of the hardware on your machine.

It could also very well be that you have devices that advertise Runtime PM but actually aren't capable. Such (faulty) devices usually need to be blacklisted in LMT. See /etc/laptop-mode/conf.d/runtime-pm.conf

For context, you could also see the wide range of issues hitting the Debian Linux kernel recently.
904441: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904441
904822: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904822
905492: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905492
905729: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905729

Aug 06 01:22:35 laika kernel: PM: Syncing filesystems ... done.
Aug 06 01:22:35 laika kernel: Freezing user space processes ... 
Aug 06 01:22:35 laika kernel: IPv6: ADDRCONF(NETDEV_UP): enp0s25: link is not ready
Aug 06 01:22:35 laika kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
Aug 06 01:22:35 laika kernel: laptop_mode     R  running task        0  9994   8424 0x00000004
Aug 06 01:22:35 laika kernel: Call Trace:
Aug 06 01:22:35 laika kernel:  ? __schedule+0x28a/0x890
Aug 06 01:22:35 laika kernel:  ? __schedule+0x28a/0x890
Aug 06 01:22:35 laika kernel:  ? preempt_schedule_irq+0x53/0x90
Aug 06 01:22:35 laika kernel:  ? reschedule_interrupt+0xa/0x20
Aug 06 01:22:35 laika kernel:  ? pci_restore_standard_config+0x40/0x40
Aug 06 01:22:35 laika kernel:  ? ioread32+0x2b/0x30
Aug 06 01:22:35 laika kernel:  ? nv04_timer_read+0x1d/0x60 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_pmu_reset+0x62/0x160 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_subdev_preinit+0x32/0x100 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_device_init+0x5d/0x280 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_udevice_init+0x41/0x60 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_object_init+0x3e/0x100 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_object_init+0x71/0x100 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nvkm_object_init+0x71/0x100 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nouveau_do_resume+0x28/0x150 [nouveau]
Aug 06 01:22:35 laika kernel:  ? nouveau_pmops_runtime_resume+0x88/0x150 [nouveau]
Aug 06 01:22:35 laika kernel:  ? pci_pm_runtime_resume+0x78/0xb0
Aug 06 01:22:35 laika kernel:  ? __rpm_callback+0x75/0x1f0
Aug 06 01:22:35 laika kernel:  ? pci_restore_standard_config+0x40/0x40
Aug 06 01:22:35 laika kernel:  ? rpm_callback+0x1f/0x70
Aug 06 01:22:35 laika kernel:  ? pci_restore_standard_config+0x40/0x40
Aug 06 01:22:35 laika kernel:  ? rpm_resume+0x5a2/0x7e0
Aug 06 01:22:35 laika kernel:  ? do_wp_page+0x8c/0x5b0
Aug 06 01:22:35 laika kernel:  ? pm_runtime_forbid+0x3e/0x50
Aug 06 01:22:35 laika kernel:  ? control_store+0x75/0x80
Aug 06 01:22:35 laika kernel:  ? kernfs_fop_write+0x116/0x190
Aug 06 01:22:35 laika kernel:  ? __vfs_write+0x36/0x170
Aug 06 01:22:35 laika kernel:  ? handle_mm_fault+0x10a/0x250
Aug 06 01:22:35 laika kernel:  ? vfs_write+0xa9/0x190
Aug 06 01:22:35 laika kernel:  ? ksys_write+0x4f/0xb0
Aug 06 01:22:35 laika kernel:  ? do_syscall_64+0x5b/0x170
Aug 06 01:22:35 laika kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 06 01:22:35 laika kernel: OOM killer enabled.

And what device is this exactly ? NVIDIA card with nouveau ? Does the issue still occur if you blacklist the nouveau driver ?

I have a laptop with Optimus (Intel and Nvidia) and as soon as my laptop is unplugged laptop mode tools loads nvidia modules and becomes unresponsive until reboot.
I use optimus-manager and bbswitch to handle Nvidia card power because nouveau module cant poweroff my card.

I have a laptop with Optimus (Intel and Nvidia) and as soon as my laptop is unplugged laptop mode tools loads nvidia modules and becomes unresponsive until reboot.
I use optimus-manager and bbswitch to handle Nvidia card power because nouveau module cant poweroff my card.

@juanSerbatic
There's nothing laptop mode tools does specific to nvidia, or nouveau for that matter. You must have something fiddling those settings around. The only thing laptop mode tools supports is vgaswitcheroo but that too is disabled by default.

@rickysarraf

I noticed it is runtime-pm module the one that powers on my discrete gpu and makes my laptop unresponsive. Disable It makes laptop mode tools work as expected. This is what happens when laptop is unplugged (if laptop mode tools is enabled with AC this is not happening):


abr 15 14:29:33 i7 root[15452]: ACPI action undefined: ACPI0003:00
abr 15 14:29:33 i7 systemd[1]: Reloading Laptop Mode Tools.
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil:
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: Can't contact ck
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: We are now into activity  "fe020fb0-8fbe-45bd-a59d-978d526f698f"
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: () ()
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: () ()
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: Loading profile for unplugged AC
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: Activity is not forcing a profile
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil:
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: powerdevil: Loading timeouts with  120000
abr 15 14:29:33 i7 org_kde_powerdevil[1910]: org.kde.knotifications: Audio notification requested, but sound file from notifyrc file was not found, aborting audio notification
abr 15 14:29:33 i7 baloo_file[1884]: Power state changed
abr 15 14:29:33 i7 laptop_mode[15454]: Laptop mode
abr 15 14:29:33 i7 laptop-mode[15531]: enabled, active
abr 15 14:29:33 i7 laptop_mode[15454]: enabled, active
abr 15 14:29:33 i7 laptop_mode[15454]: command failed: Operation not supported (-95)
abr 15 14:29:33 i7 laptop-mode[16127]: Failed to set power saving mode for wireless card
abr 15 14:29:33 i7 laptop_mode[15454]: Failed to set power saving mode for wireless card
abr 15 14:29:39 i7 systemd[1]: Reloaded Laptop Mode Tools.
abr 15 14:29:39 i7 root[21603]: ACPI action undefined: PNP0C0A:00
abr 15 14:29:39 i7 root[21657]: ACPI action undefined: PNP0C0A:00
abr 15 14:29:39 i7 root[21711]: ACPI group/action undefined: processor / LNXCPU:00
abr 15 14:29:39 i7 root[21713]: ACPI group/action undefined: processor / LNXCPU:01
abr 15 14:29:39 i7 root[21715]: ACPI group/action undefined: processor / LNXCPU:02
abr 15 14:29:40 i7 root[21717]: ACPI group/action undefined: processor / LNXCPU:03
abr 15 14:29:40 i7 root[21719]: ACPI group/action undefined: processor / LNXCPU:04
abr 15 14:29:40 i7 root[21721]: ACPI group/action undefined: processor / LNXCPU:05
abr 15 14:29:40 i7 root[21723]: ACPI group/action undefined: processor / LNXCPU:06
abr 15 14:29:40 i7 root[21725]: ACPI group/action undefined: processor / LNXCPU:07
abr 15 14:29:40 i7 root[21727]: ACPI group/action undefined: thermal_zone / LNXTHERM:00
abr 15 14:29:47 i7 kernel: IPMI message handler: version 39.2
abr 15 14:29:47 i7 kernel: ipmi device interface
abr 15 14:29:47 i7 kernel: nvidia: module license 'NVIDIA' taints kernel.
abr 15 14:29:47 i7 kernel: Disabling lock debugging due to kernel taint
abr 15 14:29:47 i7 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 238
abr 15 14:29:48 i7 kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
abr 15 14:29:48 i7 kernel: NVRM: The NVIDIA GPU 0000:01:00.0
                           NVRM: (PCI ID: 10de:1392) installed in this system has
                           NVRM: fallen off the bus and is not responding to commands.
abr 15 14:29:48 i7 kernel: nvidia: probe of 0000:01:00.0 failed with error -1
abr 15 14:29:48 i7 kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
abr 15 14:29:48 i7 kernel: NVRM: None of the NVIDIA graphics adapters were initialized!
abr 15 14:29:48 i7 kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 238

These final kernel's messages are repeating until reboot.

Thanks @juanSerbatic
This clearly is uncovering a bug in the nvidia driver. You may want to report it to the nvidia folks if you care about their driver.

As for the hang issue, in your case, you can blacklist the device based on its device id. Which from your logs, should be: 0000:01:00.0
runtime-pm has the flexibility to define ill behaving devices and drivers to be blacklisted from power savings. You could simply try that out.

Set AUTOSUSPEND_RUNTIME_DEVID_BLACKLIST="0000:01:00.0" in /etc/laptop-mode/conf.d/runtime-pm.conf

And I'm marking this bug as done because it is now a mix of multiple different reports. If you run into problems, please open separate bugs as each users' scenario is different here.