facebookincubator/below

Large number of clock_gettime syscalls

hodgesds opened this issue · 7 comments

I noticed when running below on live mode a few interesting things. First there are roughly 3-4 clock_gettime syscalls per epoll_wait.

strace -f -p `pgrep below`
...
[pid 10545] epoll_wait(7, [], 3, 0)     = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=824761942}) = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=824806781}) = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=824850152}) = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=824894012}) = 0
[pid 10545] epoll_wait(7, [], 3, 0)     = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=824976495}) = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=825019028}) = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=825062260}) = 0
[pid 10545] clock_gettime(CLOCK_MONOTONIC, {tv_sec=1030, tv_nsec=825104165}) = 0
[pid 10545] epoll_wait(7, [], 3, 0)     = 0

A quick and dirty bpftrace estimate of syscalls/sec, shows ~20k syscalls/sec:

 bpftrace -e 'tracepoint:raw_syscalls:sys_enter /comm == "below"/ { @syscalls = count(); }
    interval:s:1 { print(@syscalls); clear(@syscalls); }'
Attaching 2 probes...
@syscalls: 20399
@syscalls: 19499
@syscalls: 18809
@syscalls: 19378
@syscalls: 27254
@syscalls: 20061
@syscalls: 19144
@syscalls: 18824

Compared to clock_gettime syscalls (~15k syscalls/sec):

bpftrace -e 'tracepoint:syscalls:sys_enter_clock_gettime /comm == "below"/ { @syscalls = count(); }
    interval:s:1 { print(@syscalls); clear(@syscalls); }'
Attaching 2 probes...
@syscalls: 15518
@syscalls: 15510
@syscalls: 17648
@syscalls: 14906
@syscalls: 15070
@syscalls: 15713

Not that there is anything wrong with running clock_gettime, but when I run top/htop at the same time as below I notice it takes anywhere in the range of ~7-12% CPU. From digging around the code I'm guessing the store crate is probably where many of these calls are coming from:

rg -i 'now\(' | cut -d ':' -f 1 | sort | uniq -c
      1 below/common/src/cliutil.rs
     32 below/common/src/dateutil.rs
      2 below/model/src/collector.rs
      7 below/src/main.rs
      4 below/src/test.rs
      1 below/store/src/advance.rs
      9 below/store/src/cursor.rs
     11 below/store/src/lib.rs
      1 below/view/src/lib.rs

Hmm yeah this isn't ideal. It seems we use this generously across some variant of ::now across which presumably uses clock_gettime. We should try to see if we can improve this.

@hodgesds are you running below in live view or daemon record mode here?

Live view, here's some more data.

cpu/kernel:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  BIOS Vendor ID:        Advanced Micro Devices, Inc.
  Model name:            AMD Ryzen 7 PRO 4750U with Radeon Graphics
    BIOS Model name:     AMD Ryzen 7 PRO 4750U with Radeon Graphics
    CPU family:          23
    Model:               96
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         1700.0000
    CPU min MHz:         1400.0000
    BogoMIPS:            3393.52
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm con
                         stant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f
                         16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpex
                         t perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt c
                         lwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm
                         _lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco
                         v succor smca
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    8 MiB (2 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected
$ uname -r
Linux l1441l 5.16.0-rc2-x86_64 #5 SMP PREEMPT Sun Dec 12 21:32:14 EST 2021 x86_64 AMD Ryzen 7 PRO 4750U with Radeon Graphics AuthenticAMD GNU/Linux

dmesg of clock source related events:

 dmesg -HkuT  | grep -i clock
[Wed Dec 22 08:37:20 2021] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[Wed Dec 22 08:37:20 2021] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns
[Wed Dec 22 08:37:20 2021] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x187534688da, max_idle_ns: 440795223786 ns
[Wed Dec 22 08:37:20 2021] ..... CPU clock speed is 1696.0769 MHz.
[Wed Dec 22 08:37:20 2021] ..... host bus clock speed is 99.0809 MHz.
[Wed Dec 22 08:37:20 2021] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[Wed Dec 22 08:37:20 2021] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[Wed Dec 22 08:37:20 2021] PTP clock support registered
[Wed Dec 22 08:37:20 2021] clocksource: Switched to clocksource tsc-early
[Wed Dec 22 08:37:20 2021] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[Wed Dec 22 08:37:21 2021] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large:
[Wed Dec 22 08:37:21 2021] clocksource:                       'hpet' wd_nsec: 498437371 wd_now: e018d6 wd_last: 73330a mask: ffffffff
[Wed Dec 22 08:37:21 2021] clocksource:                       'tsc-early' cs_nsec: 778421340 cs_now: a1b316e5d cs_last: 9cc77b002 mask: ffffffffffffffff
[Wed Dec 22 08:37:21 2021] clocksource:                       'tsc-early' is current clocksource.
[Wed Dec 22 08:37:21 2021] tsc: Marking TSC unstable due to clocksource watchdog
[Wed Dec 22 08:37:21 2021] clocksource: Switched to clocksource hpet
[Wed Dec 22 08:37:21 2021] Unstable clock detected, switching default tracing clock to 'global'
                           If you want to keep using the local clock, then add:
                             'trace_clock=local'

Clock source info:

$ cat /sys/devices/system/clocksource/clocksource0/{available,current}_clocksource
hpet acpi_pm
hpet

Some interesting data when switching clock sources:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ bpftrace -e 'tracepoint:syscalls:sys_enter_clock_gettime /comm == "below"/ { @syscalls = count(); }
    interval:s:1 { print(@syscalls); clear(@syscalls); }'
Attaching 2 probes...
@syscalls: 17429
@syscalls: 16526
@syscalls: 17092
@syscalls: 15773
@syscalls: 15430
$ echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm
$ bpftrace -e 'tracepoint:syscalls:sys_enter_clock_gettime /comm == "below"/ { @syscalls = count(); }
    interval:s:1 { print(@syscalls); clear(@syscalls); }'
Attaching 2 probes...
@syscalls: 11498
@syscalls: 11862
@syscalls: 11918
@syscalls: 13113
@syscalls: 11938
@syscalls: 11978

It seems like with acpi_pm clock source there are fewer syscalls.

I can confirm clocksource is being changed from dmesg as well:

$ dmesg -HkuT | grep clocksource | tail -n 5
[Wed Dec 22 08:37:21 2021] clocksource: Switched to clocksource hpet
[Wed Dec 22 19:23:30 2021] clocksource: Switched to clocksource acpi_pm
[Wed Dec 22 19:24:57 2021] clocksource: Switched to clocksource hpet
[Wed Dec 22 19:27:42 2021] clocksource: Switched to clocksource acpi_pm
[Wed Dec 22 19:30:54 2021] clocksource: Switched to clocksource hpet

I ran the relevant timens.c kernel test for sanity purposes:

/usr/src/linux/tools/testing/selftests/timens $ ./timens
1..10
ok 1 Passed for CLOCK_BOOTTIME (syscall)
ok 2 Passed for CLOCK_BOOTTIME (vdso)
ok 3 Passed for CLOCK_BOOTTIME_ALARM (syscall)
ok 4 Passed for CLOCK_BOOTTIME_ALARM (vdso)
ok 5 Passed for CLOCK_MONOTONIC (syscall)
ok 6 Passed for CLOCK_MONOTONIC (vdso)
ok 7 Passed for CLOCK_MONOTONIC_COARSE (syscall)
ok 8 Passed for CLOCK_MONOTONIC_COARSE (vdso)
ok 9 Passed for CLOCK_MONOTONIC_RAW (syscall)
ok 10 Passed for CLOCK_MONOTONIC_RAW (vdso)
# Totals: pass:10 fail:0 xfail:0 xpass:0 skip:0 error:0

Some more perf trace data:

$ perf trace --sched -sv  -p `pgrep below`
Using CPUID AuthenticAMD-23-60-1
mmap size 528384B
 Summary of events:

 below (26201), 65586 events, 79.1%, 746.654 msec

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   clock_nanosleep      237      0  7169.494     0.000    30.251    30.664      0.42%
   clock_gettime      25586      0   117.166     0.000     0.005     0.042      0.25%
   epoll_wait          6201      0    21.078     0.000     0.003     0.048      0.66%
   openat                58      0     1.379     0.004     0.024     0.050     10.38%
   ioctl                 58      0     0.225     0.001     0.004     0.008      8.30%
   close                 58      0     0.165     0.001     0.003     0.005      6.88%
   write                  2      0     0.017     0.007     0.009     0.010     14.34%


 below (26243), 17354 events, 20.9%, 54.356 msec

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   poll                   8      0    21.783     0.022     2.723    13.213     59.53%
   read                2256      6    15.014     0.001     0.007     0.117      2.56%
   openat              2127    314     7.403     0.002     0.003     0.067      1.46%
   futex                697    105     4.389     0.000     0.006     0.066      4.81%
   close               1819      0     2.996     0.001     0.002     0.032      1.46%
   clock_gettime        718      0     1.701     0.001     0.002     0.011      1.29%
   statx                419      0     1.209     0.002     0.003     0.009      1.30%
   readlink             361    302     1.191     0.002     0.003     0.015      1.94%
   clone                  7      0     0.834     0.031     0.119     0.327     40.91%
   getdents64            20      0     0.643     0.001     0.032     0.420     64.12%
   rt_sigprocmask        14      0     0.090     0.001     0.006     0.031     39.21%
   newfstatat            18      0     0.076     0.002     0.004     0.018     21.80%
   pipe2                  4      0     0.046     0.003     0.011     0.025     42.03%
   wait4                  2      0     0.043     0.021     0.021     0.022      1.10%
   fcntl                 18      0     0.041     0.001     0.002     0.004      9.24%
   readlinkat             9      0     0.029     0.003     0.003     0.004      6.20%
   prlimit64             12      0     0.023     0.001     0.002     0.003     12.25%
   mmap                   2      0     0.020     0.009     0.010     0.012     12.98%
   munmap                 2      0     0.020     0.006     0.010     0.014     41.34%
   ioctl                  6      0     0.018     0.001     0.003     0.005     18.35%
   fstatfs                1      0     0.007     0.007     0.007     0.007      0.00%
   uname                  1      0     0.002     0.002     0.002     0.002      0.00%

After doing some more digging it seems the chrono crate doesn't use VDSO (issue), where as the standard lib should just call into glibc.

Yeah, not using VDSO is pretty silly. We either need to make the chrono crate do it or use the standard library implementation

@hodgesds Are you installing from rpm or building from source? If rpm, is it fedora? If so, which version of fedora?

According to chronotope/chrono@5115c6f#diff-3e4364fa6f0aed7a40e68aa12a41235e57f45378bcd8f0094496a7c1d8be0ed0 - chrono should be hitting the standard library here so I would guess the issue is installing using an older version of chrono crate.

Building from source from head (commit 3b6554f4c4a95d5f7c74cdd7ee6c05bfed47ffe1).

I do see vdso being (dynamically) linked:

$ ldd target/release/below
        linux-vdso.so.1 (0x00007fffc6f71000)
        libelf.so.1 => /usr/lib64/libelf.so.1 (0x00007ffa3591c000)
        libz.so.1 => /lib64/libz.so.1 (0x00007ffa35902000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/libgcc_s.so.1 (0x00007ffa358e7000)
        librt.so.1 => /lib64/librt.so.1 (0x00007ffa358dd000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffa358bd000)
        libm.so.6 => /lib64/libm.so.6 (0x00007ffa35788000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007ffa35780000)
        libc.so.6 => /lib64/libc.so.6 (0x00007ffa355c0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffa36156000)
$ objdump -rR target/release/below | grep -i time
000000000087aea8 R_X86_64_GLOB_DAT  localtime_r@GLIBC_2.2.5
000000000087b6e8 R_X86_64_GLOB_DAT  clock_gettime@GLIBC_2.17
000000000087b8c0 R_X86_64_GLOB_DAT  mktime@GLIBC_2.2.5
000000000087df50 R_X86_64_GLOB_DAT  timegm@GLIBC_2.2.5

Ok, going to close this since I think building from source acts as you expect (VDSO is cheap) - this is probably an issue with rpms from some older version of Fedora with an older chrono crate