Poor performance in `kvstore` benchmark on Linux
Closed this issue · 15 comments
This is a complicated problem; it happens only on Linux (not on OSX), and the usual (time) profiling don't reveal any obvious culprits. On this ticket we will collect some of the measurements, but the basic demonstration of the problem is that
cabal run grapesy-kvstore
reports 11.483 RPCs/sec, which should instead be as close to 25 RPCs/sec as possible (25 is the theoretical max, Java does 24.2), and
cabal run grapesy-kvstore -- --no-work-simulation
reports a measly 13.9 RPCs/sec, versus 4700 RPCs/sec for the Java benchmark (so like 2.5 orders of magnitude more). Again, on OSX this does not happen (grapesy
is in fact roughly 30% faster than the Java benchmark there); we are getting unexpected delays on Linux somewhere.
Measuring performance with --no-work-simulation
(thereby essentially measuring the raw performance) versus ghc version:
GHC version | Performance |
---|---|
8.10.7 | 13.900 RPCs/sec |
9.2.8 | 13.867 RPCs/sec |
9.4.8 | 13.683 RPCs/sec |
9.6.4 | 14.083 RPCs/sec |
9.8.2 | 13.783 RPCs/sec |
9.10.1 | 13.817 RPCs/sec |
These tests were run on a machine with other stuff running also, so we need to take these results with a grain of salt, but the conclusion seems safe that the problem exists for all recent versions of ghc (the result should be thousands of RPCs/sec, not tens).
Running in profiling mode
cabal run grapesy-kvstore -- --no-work-simulation +RTS -p -s
we see
INIT time 0.001s ( 0.001s elapsed)
MUT time 1.426s ( 60.981s elapsed)
GC time 0.081s ( 0.081s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.508s ( 61.063s elapsed)
i.e., we're barely doing anything of interest; the time profile shows that the time that we do spend is spent in reasonable places
$wthrowSocketErrorWaitRead Network.Socket.Internal <no location info> 19.0 1.3
$wthrowSocketErrorWaitWrite Network.Socket.Internal <no location info> 12.2 0.4
MAIN MAIN <built-in> 11.0 0.5
$wnextBytes KVStore.Util.RandomGen <no location info> 7.6 47.4
$wframeSender Network.HTTP2.H2.Sender Network/HTTP2/H2/Sender.hs:74:1-11 5.3 2.6
$wframeReceiver Network.HTTP2.H2.Receiver Network/HTTP2/H2/Receiver.hs:53:1-13 4.2 2.7
parseRequestHeaders Network.GRPC.Spec.Request src/Network/GRPC/Spec/Request.hs:252:1-19 3.0 0.4
but again, this is out of a total time budget that amounts of basically nothing at all.
When we analyse this with ghc-events-util
cabal run ghc-events-util -- show-delta grapesy-kvstore.eventlog --match 'CLIENT|HANDLER'
the typical pattern we see is
40.90ms 0.01ms 3193512347 cap 0 CLIENT start DELETE
0.00ms 0.00ms 3234416211 cap 0 HANDLER start DELETE
41.23ms 0.00ms 3234418221 cap 0 HANDLER stop DELETE
0.01ms 0.01ms 3275643492 cap 0 CLIENT stop DELETE
where it takes ~40ms for the message to make it from the client to the handler, then the handler spending no time at all to process it, again taking ~40ms for the message to make it back from the handler to the client, and then the client spending no time at all dealing with the response. So something is going wrong with sending messages (which also explains the blocked in STM retry
that we saw in threadscope at the end of each burst of activity). Occassionally however we see something like
0.26ms 0.01ms 3275648912 cap 0 CLIENT start DELETE
0.00ms 0.00ms 3275910900 cap 0 HANDLER start DELETE
40.57ms 0.00ms 3275911980 cap 0 HANDLER stop DELETE
0.00ms 0.00ms 3316486546 cap 0 CLIENT stop DELETE
where for whatever reason we don't get that 40ms delay. (Again, none of these delays are happening on OSX.)
We get that same kind of "usually a ~40ms delay, but not quite always" on hte handler side also
40.77ms 0.01ms 4347627064 cap 0 CLIENT start RETRIEVE
0.02ms 0.01ms 4388396368 cap 0 HANDLER start RETRIEVE
41.46ms 0.00ms 4388421328 cap 0 HANDLER stop RETRIEVE
0.00ms 0.00ms 4429885128 cap 0 CLIENT stop RETRIEVE
versus
40.81ms 0.01ms 4141594446 cap 0 CLIENT start RETRIEVE
0.05ms 0.01ms 4182405831 cap 0 HANDLER start RETRIEVE
1.03ms 0.00ms 4182452101 cap 0 HANDLER stop RETRIEVE
0.00ms 0.00ms 4183479183 cap 0 CLIENT stop RETRIEVE
Digging a little deeper, enabling the trace-foreign-calls
plugin using cabal.project.plugin
in the grapesy
repo
cabal run ghc-events-util -- show-delta grapesy-kvstore.eventlog \
--show-from 3193512347 --show-until 3234416211
If we take a look at one of these slow runs, we unfortunately get no useful information; the only event that takes any time is the STM retry:
40.51ms 40.51ms 3193795335 cap 0 stopping thread 272 "grapesy:handler" (blocked in STM retry)
This means that if we are in fact blocked on some kind of foreign call, we cannot see it. Unfortunately, this may very well be because we currently cannot use the trace-foreign-calls
on any of the ghc boot libraries. This may be something we have to address to find the problem.
Running with perf
(see kvstore.md
for details)
perf record $(find . -name grapesy-kvstore -type f) --no-work-simulation
then, perhaps as expected, we basically see a whole lot of nothing in perf report
:
Overhead Command Shared Object Symbol
2,74% grapesy-kvstore grapesy-kvstore [.] _KVStoreziUtilziRandomGen_zdlzdwio_g51W_entry
1,18% ghc_worker libc.so.6 [.] __memmove_avx_unaligned_erms
1,10% ghc_worker [kernel.kallsyms] [k] _copy_to_iter
1,08% ghc_worker [kernel.kallsyms] [k] srso_alias_safe_ret
1,08% grapesy-kvstore grapesy-kvstore [.] stg_upd_frame_info+0xffffffffffc000d7
This is not unexpected at this point of course because we already know that we spending most of our time waiting, and perf
by default is measuring on-cpu time, not off-cpu time. My attempts to use perf
to measure off-CPU time have so far not yet been successful; perhaps some useful pointers are #1 and #2.
With strace
, we can see a very similar pattern to what we see with ghc-events-util
(this depends on #163). Running
strace -f -o strace.log --relative-timestamps=ms -e write \
$(find . -name grapesy-kvstore -type f) --no-work-simulation
gives stuff like
754292 0.000 write(15, "CLIENT start CREATE\n", 20) = 20
754296 0.045 write(15, "HANDLER start CREATE\n", 21) = 21
754296 0.000 write(15, "HANDLER stop CREATE\n", 21) = 21
754292 0.040 write(15, "CLIENT stop CREATE\n", 19) = 19
(note that strace
gives the relative time to the previous event, not the next).
If we drop that -e write
filter, see can try to see which events sit in between the CLIENT start CREATE
and the HANDLER start CREATE
, we see only a few events of significant time (marked **
):
759865 0.000039 write(15, "CLIENT start DELETE\n", 20) = 20
..
759869 0.000021 futex(0x238bae8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
759866 ** 0.008688 <... poll resumed>) = 1 ([{fd=3, revents=POLLIN}])
759866 0.000030 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866 0.000040 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
759866 ** 0.009945 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866 0.000030 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
759866 ** 0.009970 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866 0.000049 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
759866 ** 0.009957 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866 0.000040 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1 <unfinished ...>
759867 ** 0.001706 <... epoll_wait resumed>[{events=EPOLLIN, data={u32=16, u64=16}}], 64, -1) = 1
..
759869 0.000071 write(15, "HANDLER start DELETE\n", 21) = 21
the four longer-lasting events here together take 40.26ms (with 38.56 just for the first four).
Using a branch of ghc
modified by Andreas for profiling safe FFI calls, running
cabal run grapesy-kvstore -- --no-work-simulation +RTS -p -ps
unfortunately does not give us much more useful information
MAIN MAIN <built-in> 99.4 0.5
$wnextBytes KVStore.Util.RandomGen <no location info> 0.1 47.1
We tried with different I/O managers on Linux (@FinleyMcIlwaine can perhaps document the details), but that did not make a difference either. (I thought I remembered that http2
somewhere in its stack cannot run with the non-threaded runtime, because it depends on getSystemTimerManager
which is only available in the threaded runtime, but perhaps I remembered incorrectly, and it's only quic
that needs it.)
Even more worrying, we have noticed that a small http2
-only reproducer simply just stops after a while (it also exhibits the slow-down we are experiencing, but only in specific scenarios). Escalating this ticket from performance
to bug
.
This is due to not using TCP_NODELAY
. Fixing this is currently blocked on kazu-yamamoto/network-run#6 .
The plan is to make TCP_NODELAY
optional in grapesy
(both in ServerParams
and in ConnParams
), but default to True
, since the loss in performance for synchronous comms with the wrong default is way worse than the loss in performance for asynchronous comms with the wrong (opposite!) default. We should document the trade-off in the parameter Haddocks.
Strangely, @FinleyMcIlwaine is observing much better numbers on Linux than I am (with TCP_NODELAY
). Unclear why at present (on a comparable machine, if anything slightly slower).
The poor performance was due to TCP_NODELAY
. @FinleyMcIlwaine is seeing grapesy
benchmark doing 4.3k RPCs/sec (no work simulation, non-concurrent client), with 7k for the Java (no work simulation, non-concurrent); @edsko is seeing 1.8k for grapesy
, 4k ish for Java. In both cases "roughly" 2x in the difference, feels plausible. We could consider further optimizing grapesy
but seems independent from this ticket.