wingo/fibers

High CPU usage on system time change

jetomit opened this issue · 7 comments

Since Guix upgraded to guile-fibers 1.3.1, shepherd hangs shortly after boot on systems without a RTC. I believe the problem comes from using get-internal-real-time in the guile-fibers timer wheel implementation. After NTP corrects the system time, this function returns a much larger value, and the CPU load (for one core) goes to 100%.

Profiling suggests the process spends the CPU time in timer-wheel-advance!, so I imagine it is trying to tick through a five-year time diff. I tried increasing the system time manually by N days, which causes shepherd to be unresponsive (e.g. to herd status) for about N×5 seconds. I observed similar behavior with the example from guile-fibers readme.

Replacing all instances of (get-internal-real-time) with (clock-gettime 1) in guile-fibers, and reconfiguring the system with the patched package, fixes this problem. I think using a monotonic clock makes sense, but there is probably a cleaner / more portable way to do it.

Thanks!

Hi @jetomit!

Using CLOCK_MONOTONIC as you suggest seemed like the right choice to me so I started working on it. However, the API of (fibers timers) as well as schedule-task-at-time expect "internal time units"; changing timer-wheel to use CLOCK_MONOTONIC would affect those interfaces similarly, which is not acceptable.

Instead we should probably change timer-wheel-advance! to cope with large gaps.

@wingo, WDYT?

Thanks!

@jetomit Here's a proposed workaround on the Guix side: https://issues.guix.gnu.org/64966

Here's a proposed workaround on the Guix side: https://issues.guix.gnu.org/64966

This would work for aarch64, but I also encounter this issue on armhf and x86_64 systems. This happens whenever system time is pushed forward by a significant amount (a day or more), either by ntpd or manually.

As I understand it, guile’s internal-time-units only depends on the platform and is the same for all clock types. The bigger problem with using CLOCK_MONOTONIC might be that it doesn’t count time the system is suspended, which would probably break stuff.

Another report of shepherd spinning once system time has changed: https://issues.guix.gnu.org/66684

@wingo Hello! Did you have a chance to look into that? I'd be happy to try and implement any suggestions you might have (I'd love to do that before Shepherd 1.0 is out).

Took a look at it but it requires a bit of concentration to not introduce bugs :) Do have a look if you like!

Just posting for the records another example "in the wild" of someone working around this issue. https://issues.guix.gnu.org/70892#3

Thanks for all your hard work! 😄