Excessive CPU time spent in clock_gettime syscalls under moderate load

Question

Excessive CPU time spent in clock_gettime syscalls under moderate load

jbg opened this issue 3 years ago · 3 comments

Description

Under load testing, we noticed that current JVB master spends most of its CPU time in kernel space, which I found surprising expecting it to be mostly userland and interrupts. Attaching strace to the busiest thread (thread name: org.jitsi.utils) gave this report:

# strace -c -p 10018
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.62   22.894570         422     54182           clock_gettime
  3.35    0.794870        1906       417       206 futex
  0.02    0.004398        4398         1         1 restart_syscall
  0.00    0.000704         234         3           rt_sigprocmask
  0.00    0.000502          71         7           gettid
  0.00    0.000355         118         3           rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00   23.695399         433     54613       207 total

It's calling clock_gettime 2286 times per second and spending almost all of its time inside that syscall. This is with just 10 active video senders. There are multiple Global CPU pool threads also spending most of their time inside clock_gettime. On this lab system, clock_gettime system calls account for most of the CPU usage of the JVB and causes it to hit 100% CPU much earlier than it used to as we increase the load.

A separate run with async-profiler shows:

          ns  percent  samples  top
  ----------  -------  -------  ---
 99323573241   82.43%     9573  read_hpet_[k]

Unfortunately we have an issue with JDK debug symbols so can't get the Java part of the relevant stacktraces; they all look like this:

  [ 0] read_hpet_[k]
  [ 1] ktime_get_ts64_[k]
  [ 2] posix_get_monotonic_timespec_[k]
  [ 3] __x64_sys_clock_gettime_[k]
  [ 4] do_syscall_64_[k]
  [ 5] entry_SYSCALL_64_after_hwframe_[k]
  [ 6] __vdso_clock_gettime
  [ 7] __clock_gettime_2
  [ 8] [unknown_Java]

We'll resolve the JDK debug symbols problem and update this issue. We'll also try to bisect this against the JVB repository to find out when the issue began.

This could be caused by a high rate of calls to System.currentTimeMillis() which uses clock_gettime(CLOCK_REALTIME) internally. A lot has been written in the past about how (relatively) slow this method is, at least on Linux when the clock source is a HPET, and there's a bug report asking for it to be switched to use the much faster clock_gettime(CLOCK_REALTIME_COARSE) (which still provides sufficient precision for currentTimeMillis), but it's not had much traction, so the workaround is basically to call it less often or switch to a different clocksource like TSC (which we may experiment with).

It's also plausible that it could be something environmental. Seeing clock_gettime and futex at the top of a profile reminded me of the Linux leap second futex bug, but that was fixed years ago. It would be much appreciated if anyone else can try to reproduce these results.

Steps to reproduce

Put 10 or so people in a conference.
Observe high System (sy) CPU usage.
Run top -H and observe the PID of the busiest JVB thread.
Run strace -c -p $PID and interrupt it after 10-20 seconds.
Observe that clock_gettime consumes most of the system CPU time.

Environment details

Xeon E-2278G
Linux 5.14.9
OpenJDK 16.0.2 (tested a few different OpenJDK versions down to 8 and up to 17 with similar results)

Answer 1 · 2021-10-07T16:05:44.000Z

Do you have any additional logging (like timeseries) enabled? Can you attach a java profiler?

I can not reproduce this in our environment: sy stays relatively low (~5%) and clock_gettime doesn't show up at all in the strace output (I tried 2 of the CPU pool threads and a few of the IO pool threads).

Answer 2 · 2021-10-07T16:06:06.000Z

I can confirm this is somewhat an environmental issue, but it's one that might affect others as well.

Due to either a BIOS bug or a Linux kernel bug, our lab server's TSC clocksource was being marked unstable, and the HPET was being used as the clocksource instead. clock_gettime(CLOCK_REALTIME) (and thus System.currentTimeMillis()) is about 20x slower when the clocksource is HPET vs. when it is TSC.

Booting the system with hpet=disable causes TSC to be selected as clocksource and the problem disappears; CPU usage drops back to normal levels.

So, anyone seeing excessive CPU usage on their JVBs, run:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

If the output is hpet, try booting with hpet=disable added to the kernel command line.

Incidentally, even after the fix, once we push a few hundred participants, clock_gettime is still pretty significant in the profiling results. If there's any easy way to reduce the amount of System.currentTimeMillis() and similar timekeeping, it could be a win.

Answer 3 · 2021-10-07T18:19:58.000Z

🍿 I found this fascinating to read.