well-typed/grapesy

Poor performance in `kvstore` benchmark on Linux

Closed this issue · 15 comments

This is a complicated problem; it happens only on Linux (not on OSX), and the usual (time) profiling don't reveal any obvious culprits. On this ticket we will collect some of the measurements, but the basic demonstration of the problem is that

cabal run grapesy-kvstore

reports 11.483 RPCs/sec, which should instead be as close to 25 RPCs/sec as possible (25 is the theoretical max, Java does 24.2), and

cabal run grapesy-kvstore -- --no-work-simulation

reports a measly 13.9 RPCs/sec, versus 4700 RPCs/sec for the Java benchmark (so like 2.5 orders of magnitude more). Again, on OSX this does not happen (grapesy is in fact roughly 30% faster than the Java benchmark there); we are getting unexpected delays on Linux somewhere.

Measuring performance with --no-work-simulation (thereby essentially measuring the raw performance) versus ghc version:

GHC version Performance
8.10.7 13.900 RPCs/sec
9.2.8 13.867 RPCs/sec
9.4.8 13.683 RPCs/sec
9.6.4 14.083 RPCs/sec
9.8.2 13.783 RPCs/sec
9.10.1 13.817 RPCs/sec

These tests were run on a machine with other stuff running also, so we need to take these results with a grain of salt, but the conclusion seems safe that the problem exists for all recent versions of ghc (the result should be thousands of RPCs/sec, not tens).

Running in profiling mode

cabal run grapesy-kvstore -- --no-work-simulation +RTS -p -s 

we see

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    1.426s  ( 60.981s elapsed)
  GC      time    0.081s  (  0.081s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.508s  ( 61.063s elapsed)

i.e., we're barely doing anything of interest; the time profile shows that the time that we do spend is spent in reasonable places

$wthrowSocketErrorWaitRead              Network.Socket.Internal           <no location info>                                19.0    1.3
$wthrowSocketErrorWaitWrite             Network.Socket.Internal           <no location info>                                12.2    0.4
MAIN                                    MAIN                              <built-in>                                        11.0    0.5
$wnextBytes                             KVStore.Util.RandomGen            <no location info>                                 7.6   47.4
$wframeSender                           Network.HTTP2.H2.Sender           Network/HTTP2/H2/Sender.hs:74:1-11                 5.3    2.6
$wframeReceiver                         Network.HTTP2.H2.Receiver         Network/HTTP2/H2/Receiver.hs:53:1-13               4.2    2.7
parseRequestHeaders                     Network.GRPC.Spec.Request         src/Network/GRPC/Spec/Request.hs:252:1-19          3.0    0.4

but again, this is out of a total time budget that amounts of basically nothing at all.

Speedscope also confirms we're spending most of our time doing nothing:

image

We can of course zoom in on that tiny little sliver on the right-hand side, which then looks more reasonable

image

but no matter what we improve there, with the system waiting for much, it's not going do help very much at all.

Threadscope reveals long periods of waiting, interrupted by small bursts of activity:

image

The small bursts of activity end on "blocked in STM retry"; that probably just means we're waiting to be sent the message.

When we analyse this with ghc-events-util

cabal run ghc-events-util -- show-delta grapesy-kvstore.eventlog --match 'CLIENT|HANDLER'

the typical pattern we see is

  40.90ms     0.01ms    3193512347  cap 0  CLIENT start DELETE
   0.00ms     0.00ms    3234416211  cap 0  HANDLER start DELETE
  41.23ms     0.00ms    3234418221  cap 0  HANDLER stop  DELETE
   0.01ms     0.01ms    3275643492  cap 0  CLIENT stop DELETE

where it takes ~40ms for the message to make it from the client to the handler, then the handler spending no time at all to process it, again taking ~40ms for the message to make it back from the handler to the client, and then the client spending no time at all dealing with the response. So something is going wrong with sending messages (which also explains the blocked in STM retry that we saw in threadscope at the end of each burst of activity). Occassionally however we see something like

   0.26ms     0.01ms    3275648912  cap 0  CLIENT start DELETE
   0.00ms     0.00ms    3275910900  cap 0  HANDLER start DELETE
  40.57ms     0.00ms    3275911980  cap 0  HANDLER stop  DELETE
   0.00ms     0.00ms    3316486546  cap 0  CLIENT stop DELETE

where for whatever reason we don't get that 40ms delay. (Again, none of these delays are happening on OSX.)

We get that same kind of "usually a ~40ms delay, but not quite always" on hte handler side also

  40.77ms     0.01ms    4347627064  cap 0  CLIENT start RETRIEVE
   0.02ms     0.01ms    4388396368  cap 0  HANDLER start RETRIEVE
  41.46ms     0.00ms    4388421328  cap 0  HANDLER stop  RETRIEVE
   0.00ms     0.00ms    4429885128  cap 0  CLIENT stop RETRIEVE

versus

  40.81ms     0.01ms    4141594446  cap 0  CLIENT start RETRIEVE
   0.05ms     0.01ms    4182405831  cap 0  HANDLER start RETRIEVE
   1.03ms     0.00ms    4182452101  cap 0  HANDLER stop  RETRIEVE
   0.00ms     0.00ms    4183479183  cap 0  CLIENT stop RETRIEVE

Digging a little deeper, enabling the trace-foreign-calls plugin using cabal.project.plugin in the grapesy repo

cabal run ghc-events-util -- show-delta grapesy-kvstore.eventlog \
  --show-from 3193512347 --show-until 3234416211 

If we take a look at one of these slow runs, we unfortunately get no useful information; the only event that takes any time is the STM retry:

  40.51ms    40.51ms    3193795335  cap 0  stopping thread 272 "grapesy:handler" (blocked in STM retry)

This means that if we are in fact blocked on some kind of foreign call, we cannot see it. Unfortunately, this may very well be because we currently cannot use the trace-foreign-calls on any of the ghc boot libraries. This may be something we have to address to find the problem.

Running with perf (see kvstore.md for details)

perf record $(find . -name grapesy-kvstore -type f) --no-work-simulation

then, perhaps as expected, we basically see a whole lot of nothing in perf report:

Overhead  Command          Shared Object         Symbol
   2,74%  grapesy-kvstore  grapesy-kvstore       [.] _KVStoreziUtilziRandomGen_zdlzdwio_g51W_entry
   1,18%  ghc_worker       libc.so.6             [.] __memmove_avx_unaligned_erms
   1,10%  ghc_worker       [kernel.kallsyms]     [k] _copy_to_iter
   1,08%  ghc_worker       [kernel.kallsyms]     [k] srso_alias_safe_ret
   1,08%  grapesy-kvstore  grapesy-kvstore       [.] stg_upd_frame_info+0xffffffffffc000d7

This is not unexpected at this point of course because we already know that we spending most of our time waiting, and perf by default is measuring on-cpu time, not off-cpu time. My attempts to use perf to measure off-CPU time have so far not yet been successful; perhaps some useful pointers are #1 and #2.

With strace, we can see a very similar pattern to what we see with ghc-events-util (this depends on #163). Running

strace -f -o strace.log --relative-timestamps=ms -e write \
 $(find . -name grapesy-kvstore -type f) --no-work-simulation

gives stuff like

754292      0.000 write(15, "CLIENT start CREATE\n", 20) = 20
754296      0.045 write(15, "HANDLER start CREATE\n", 21) = 21
754296      0.000 write(15, "HANDLER stop  CREATE\n", 21) = 21
754292      0.040 write(15, "CLIENT stop CREATE\n", 19) = 19

(note that strace gives the relative time to the previous event, not the next).

If we drop that -e write filter, see can try to see which events sit in between the CLIENT start CREATE and the HANDLER start CREATE, we see only a few events of significant time (marked **):

759865      0.000039 write(15, "CLIENT start DELETE\n", 20) = 20
..
759869      0.000021 futex(0x238bae8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
759866 **   0.008688 <... poll resumed>) = 1 ([{fd=3, revents=POLLIN}])
759866      0.000030 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866      0.000040 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
759866 **   0.009945 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866      0.000030 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
759866 **   0.009970 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866      0.000049 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
759866 **   0.009957 read(3, "\1\0\0\0\0\0\0\0", 8) = 8
759866      0.000040 poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1 <unfinished ...>
759867 **   0.001706 <... epoll_wait resumed>[{events=EPOLLIN, data={u32=16, u64=16}}], 64, -1) = 1
..
759869      0.000071 write(15, "HANDLER start DELETE\n", 21) = 21

the four longer-lasting events here together take 40.26ms (with 38.56 just for the first four).

Using a branch of ghc modified by Andreas for profiling safe FFI calls, running

cabal run grapesy-kvstore -- --no-work-simulation +RTS -p -ps 

unfortunately does not give us much more useful information

MAIN            MAIN                              <built-in>                                        99.4    0.5
$wnextBytes     KVStore.Util.RandomGen            <no location info>                                 0.1   47.1

We tried with different I/O managers on Linux (@FinleyMcIlwaine can perhaps document the details), but that did not make a difference either. (I thought I remembered that http2 somewhere in its stack cannot run with the non-threaded runtime, because it depends on getSystemTimerManager which is only available in the threaded runtime, but perhaps I remembered incorrectly, and it's only quic that needs it.)

Even more worrying, we have noticed that a small http2-only reproducer simply just stops after a while (it also exhibits the slow-down we are experiencing, but only in specific scenarios). Escalating this ticket from performance to bug.

This is due to not using TCP_NODELAY. Fixing this is currently blocked on kazu-yamamoto/network-run#6 .

The plan is to make TCP_NODELAY optional in grapesy (both in ServerParams and in ConnParams), but default to True, since the loss in performance for synchronous comms with the wrong default is way worse than the loss in performance for asynchronous comms with the wrong (opposite!) default. We should document the trade-off in the parameter Haddocks.

Strangely, @FinleyMcIlwaine is observing much better numbers on Linux than I am (with TCP_NODELAY). Unclear why at present (on a comparable machine, if anything slightly slower).

The poor performance was due to TCP_NODELAY. @FinleyMcIlwaine is seeing grapesy benchmark doing 4.3k RPCs/sec (no work simulation, non-concurrent client), with 7k for the Java (no work simulation, non-concurrent); @edsko is seeing 1.8k for grapesy, 4k ish for Java. In both cases "roughly" 2x in the difference, feels plausible. We could consider further optimizing grapesy but seems independent from this ticket.