sched-ext/scx

kernel: Implement cached current time interface

Closed this issue · 12 comments

SCX schedulers tend to use bpf_ktime_get_ns() a lot. On x86, this eventually is serviced by rdtsc_ordered() which is the rdtscp instruction. The instruction is known to be expensive and has scalability issues on large machines when the CPUs are saturated (the cost of the instruction increases as machine gets saturated).

In most cases, we don't really care about nanosec accuracy that we're paying for. There can be a couple approaches in addressing this:

  1. Cache ktime_get_ns() result and provide a kfunc to access the cached time. The kernel should have reasonable cache invalidation points (e.g. at the start of dispatch and when rq lock is released during dispatch and so on).
  2. Provide interface to access timestamp which is not program ordered (ie. rdtsc in x86) along with helpers to compare and calculate delta between two timestamps. rdtsc is cheaper and doesn't have scalability issues that rdtscp has but it's unclear how this would map in other archs.

Considerations:

  • One property which is nice but expensive is guarnateeing that timestamps are correctly ordered - ie. the ordering between two timestamps always agrees with the ordering knowable to the program on the same CPU and across different CPUs. rdtscp guarantees this but that's why it's expensive.
  • For usage accounting purposes, we can do away with the above guarantee. We don't need that accurate time but dealing with occassional surprises (e.g. after - before underflowing) can be cause some headaches. Can probably be alleviated reasonably with a good set of helpers.

I have looked a bit, and this is what I am thinking.

Considering the discussion on the BPF mailing list (https://lore.kernel.org/bpf/CAEf4BzaBNNCYaf9a4oHsB2AzYyc6JCWXpHx6jk22Btv=UAgX4A@mail.gmail.com/), I think we can assume the following two (or something similar) new APIs:

  • bpf_get_hw_counter(), which eventually calls rdtsc in x86
  • an API (e.g., bpf_hw_counter_to_ns()), which converts a time stamp counter to nano-seconds (or vice versa)

Based on these two new APIs, we can provide two common utility functions at common.bpf.h abstracting the details of bpf_get_hw_counter() and bpf_ktime_get_ns().

  • scx_bpf_get_time(): returns a current time in a certain unit (timestamp counter or ns). It chooses bpf_get_hw_counter() over bpf_ktime_get_ns() if available for performance.
  • scx_bpf_time_to_ns(): converts the time returned from scx_bpf_get_time() to nano-seconds.

In my opinion, it would be quite difficult to handle the clock drift of rdtsc if its bound is unknown. So rather than handling the clock drift magically, it would be more straightforward to let scx scheduler handle it. I think in the scx schedulers' usage case, only thing to consider will be checking the new timestampe is smaller than an old timestamp.

What do you think, @htejun ?

One challenge is that rdtsc is reported to not scale well on large machines on intel sapphire rapids, so even it looks like we'd likely need caching whether we use rdtsc or rdtscp.

For reference, Chris Mason implemented a simple benchmark to tests TSC performance: https://github.com/masoncl/tscbench. The main problem being observed is rdtscp and rdtsc depending on the CPU type getting progressively slower on larger setups as CPUs get saturated.

Some thoughts on the problem:

  1. Why not make the API monotonic by default if there is no apparent good way of handling rdtsc drift, apart from abefore > after check? We could leave it to the scheduler if there was a design choice to be made about this, but that doesn't seem to be the case. Making the call monotonic seems more intuitive if the alternative is to always manually check.

  2. Is there a good reason to return both timestamp counters and ns? Possibly returning the raw counter forces the scheduler to consider the source of the timestamp when parsing it which doesn't seem desirable. I think a single both scx_bpf_get_time and scx_bpf_time_to_ns could get rolled into a single scx_bpf_get_time_ns that only returns ns.

  3. If rdtsc is problematic for some machines but works fine for others, we still need the option to return a cached value. We can add a flag in the call to explicitly specify the source of the value, but would it be worth it? Are there scenarios where reading rdtsc is superior to returning a cached value, or can we always use the latter?

Thoughts @multics69 @htejun ?

  1. Yeah, monotonic on each CPU makes sense to me but it'd be difficult to make it monotonic across the system. e.g. Given a sequence of events that are known to be ordered and happening across CPUs, can we guarantee that timestamp taken later on a different CPU is time-wise later? If we can't, then we still have the same ordering problem when comparing.
  2. I think it'd be better to always go for absolute time.
  3. If we can get caching right, we can probably always just use cached timestamp.

Thank you for the feedback @htejun and @etsal !

  1. I agree with @htejun 's point. Making a time stamp monotonic system-wide is a hard problem. I think it would be difficult (if not impossible) without a central cache hotspot.

  2. Exposing only absolute time would be less confusing.

  3. I will further develop ideas about the caching, if it would be possible efficiently and scalabily to cache a timestamp without creating cache hotspots.

BTW, @htejun -- do you have some tsc benchmark numbers on sapphire rapids? It would be great to understand how bad rdtsc is to know the budget we can use for caching.

I only heard results second-hand. IIRC, on sapphire rapids, rdtsc wasn't much better than rdtscp.

TSCBench results with a single-threaded execution (default)

2-socket Xeon (Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 40 CPUs = 10x2x2)

rdtscp cmp rdtsc cmp rdtsc_lfence cmp clock_gettime cmp
low_ipc 0.96 0.98 0.96 0.95
high_ipc 0.97 0.99 0.99 0.98

1-socket AMD (AMD Ryzen 9 PRO 6950H, 16 CPUs = 8x1x2)

rdtscp cmp rdtsc cmp rdtsc_lfence cmp clock_gettime cmp
low_ipc 0.95 0.99 0.95 0.23
high_ipc 0.97 0.98 0.97 0.24

1-socket Intel Rapter Lake (13th Gen Intel(R) Core(TM) i7-13700H, 20 CPUs = 6x2+8)

rdtscp cmp rdtsc cmp rdtsc_lfence cmp clock_gettime cmp
low_ipc 0.93 0.95 0.93 0.93
high_ipc 0.98 0.97 0.99 0.95

SteamDeck OLED (AMD = 8 CPUs = 4x2)

rdtscp cmp rdtsc cmp rdtsc_lfence cmp clock_gettime cmp
low_ipc 0.94 0.97 0.95 0.93
high_ipc 0.95 0.97 0.95 0.93

TSCBench results with a multi-threaded execution (N=the max CPUs)

  • Ran tscbench multi-threaded extension at here
  • rdtsc_casis a rdtsc based lock-free caching protocol using a CAS instruction.

2-socket Xeon (Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 40 CPUs = 10x2x2, nthreads=40)

rdtscp cmp rdtsc cmp rdtsc_lfence cmp rdtsc_cas
low_ipc 0.97 0.98 0.97 0.12
high_ipc 1.0 1.0 1.0 0.36

1-socket AMD (AMD Ryzen 9 PRO 6950H, 16 CPUs = 8x1x2, nthreads=16)

rdtscp cmp rdtsc cmp rdtsc_lfence cmp rdtsc_cas
low_ipc 0.98 1.0 0.99 0.68
high_ipc 1.0 1.0 1.0 0.92

The observations so far can be summarized as follows:

  • rdtsc is slightly faster (~ 2%) than rdtscp.
  • At least, for the tested machines, both scale pretty well with an increasing number of CPUs.
  • Touching one shared cacheline (rdtsc_cas) results in the scalability disaster because the cache coherence is much worse in scalability than rdtsc family.