heftig/rtkit

Canary thread "starves" after s2idle

heftig opened this issue · 6 comments

A Dell XPS 13 2-in1 (7390) defaults to using suspend-to-idle for mem sleep. After resume, rtkit thinks the canary thread starved and demotes all the realtime threads.

I can reproduce this on a Dell XPS 9380 on Fedora 34. I get audio glitches under load because all pipewire processes lose SCHED_RR on resume.

What's the best way to fix this? Can we signal the watchdog thread when going to suspend and then again on resume so it knows to ignore the lack of canary cheeps while suspended?

It's a bit confusing to me that despite using CLOCK_MONOTONIC, the clock is proceeding while suspended. That shouldn't be happening, right?

Edit: to clarify, I use mem_sleep_default=deep, and /sys/power/mem_sleep is deep, so this is happening with S2RAM in my case, not S2idle.

This also happens with S3 sleep/suspend, although it does eventually allow RT threads again after about 20 minutes.

I don't want to send spam, but this issue provides some context for the same issue. The ticket was closed because fedora 21 got EOL stage, not because it was solved. https://bugzilla.redhat.com/show_bug.cgi?id=688282
The ticket was opened 12 years ago.

The comment #5 by Lennart Poettering suggested that to resolve this, we need to have some notification from the kernel about the suspend.

But I think this issue can have some other workaround. If the root issue is in that the rtkit thread goes to sleep for a long time and reports "big difference in time" after resume, rtkit assumes that this canary thread was blocked for a long time.

Can't we just... you know... use systemd? That systemd, by Lennart? Split rtkit to two parts? One always running (the main daemon) and the second one, watching for thread starvation... and use systemd to stop this watching process before suspend target and start it after resume target?

Or, before this demotion of everything, can't we just... "ping" that thread to see if it's really starved, or it was just some one time issue?
Or use current CPU load to see if the starvation is really an issue?

Or ignore the starvation if the time between the thread responding are greater than, say, 30 minutes?

As I see it, the purpose of the canary thread is to see if the system is stable enough and rtkit can be at ease that the system can actually perform well if all those threads are prioritized.
But can't it just launch two prioritized canary threads and see if they are both running in tandem correctly, and if both are not behaving, assume something else happened and do nothing?
Because if those two threads can't do their job (aren't realtime'd enough), then how could rtkit-daemon itself be "okay" under such condition? It just couldn't, or it would have to be VERY lucky to function at all?

Is logind's dbus 'PrepareForSleep()' signal (https://www.freedesktop.org/software/systemd/man/org.freedesktop.login1.html) not good enough for this? We might not be able to assume it's always there because it's part of systemd, but when it is there it seems like the canonical way to detect such a condition.

PR up to fix this using the dbus notification mechanism built into systemd logind (when present).

Any progress on this?