Latency decreases as throughput increases under rate-limiting
Avidanborisov opened this issue · 2 comments
Using the recently introduced --rate-limit
option, I've ran a simple benchmark where the rate limit doubles each time, to see the relation between the latency and the throughput as the sustained throughput increases.
Here are the results, as well as the command to reproduce (the server is Redis)
$ for i in 1000 2000 4000 8000 16000 32000 64000 128000; do memtier_benchmark -h 10.20.1.4 --hide-histogram --test-time 30 --threads 1 --clients 50 --rate-limit $((i/50
)); done 2>/dev/null | grep -E "(Type|Totals)"
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 1001.44 0.00 909.79 0.97415 0.75900 8.19100 10.04700 42.50
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 2001.06 0.00 1817.78 0.81948 0.71100 6.23900 9.72700 84.94
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 4000.67 0.00 3635.77 0.77258 0.62300 6.27100 9.79100 169.73
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 7999.25 0.00 7271.14 0.69692 0.63100 2.62300 9.40700 339.32
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 15747.45 0.00 14314.50 0.66834 0.63900 1.31100 8.70300 667.98
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 31851.22 1.67 28953.36 0.65727 0.64700 1.19100 7.03900 1350.96
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 63371.75 6.67 57603.23 0.64743 0.65500 1.15900 4.22300 2688.16
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Totals 78407.32 24.70 71254.00 0.63731 0.63900 1.07900 4.25500 3326.53
The results are surprising - all the latency metrics seems to decrease, as the throughput increases. My expectation from similar benchmarks is to see the latency gradually increase as the sustained throughput increases, to the point it explodes when a bottleneck is reached. Is there an issue with the latency calculation or is this expected?
Thanks!
@Avidanborisov I've tested your script, while doing some changes to ensure that the conditions on each iteration are equal, namely:
- ensure clean DB at start of each stage
- ensure we use a small key range to ensure that the ammount of commands issued does not affect the performance (meaning longer/more ops/sec runs don't impact the key range)
- ensure we don't benchmark invalid/faster commands (like the misses/sec you have above) by simply using a write command
- also checking the internal DB latency
script:
#!/bin/bash
HOST=192.168.1.200
PORT=6379
C=50
A="perf"
for rate in 1000 2000 4000 8000 16000 32000 64000 128000; do
echo "--------------------------------------------------"
echo "running $rate"
redis-cli --no-auth-warning -a $A -h $HOST flushall >/dev/null
redis-cli --no-auth-warning -a $A -h $HOST config resetstat >/dev/null
memtier_benchmark --test-time 60 -s $HOST -p $PORT -c $C -t 1 --rate-limiting $(($rate/$C)) -a $A --json-out-file $rate.json --key-maximum 1 --key-minimum 1 --ratio 1:0 --hide-histogram 2>/dev/null | grep -E "(Type|Totals)"
redis-cli --no-auth-warning -a $A -h $HOST info commandstats | grep "set:" | awk '{split($0,a,","); print a[1],a[3]}'
echo "--------------------------------------------------"
sleep 15
done
After running the above on 2 physical nodes with Static High Performance Power Mode—Processors run in the maximum power and performance state, regardless of the OS power management policy, we get the following results:
Ops/sec | Average internal command latency (ms) | Average client latency including RTT (ms) | p50 latency including RTT (ms) | p99 latency including RTT (ms) | p999 latency including RTT (ms) |
---|---|---|---|---|---|
999.94 | 0.00033 | 0.21428 | 0.199 | 1.271 | 2.239 |
1967.02 | 0.00034 | 0.18509 | 0.175 | 0.359 | 2.079 |
3999.71 | 0.00034 | 0.17738 | 0.175 | 0.287 | 1.831 |
7999.07 | 0.00035 | 0.17585 | 0.175 | 0.287 | 1.495 |
16270.22 | 0.00035 | 0.17376 | 0.175 | 0.271 | 0.583 |
31996.51 | 0.00034 | 0.17849 | 0.175 | 0.287 | 0.407 |
56753.64 | 0.00034 | 0.17604 | 0.175 | 0.279 | 0.391 |
57265.4 | 0.00034 | 0.17452 | 0.175 | 0.271 | 0.383 |
As confirmed above, I deeply suspect that your system has some kind of power governor that is scaling the frequency of your system. For example, If I enable power saving on the nodes I immediately get different results and only best performance at the end.
I suggest you check this on your end, and if you can't control the frequency, run the benchmark with the order of the rate in reverse (start high rate, and end low rate, and you should see best results at the end).
For now I'm closing this issue given there is no evidence of issues in the code of memtier or redis :)
@filipecosta90 Thanks for the detailed and clear response. I'll be sure to take a look at your suggestions for more accurate benchmarking of this phenomenon.
However, as far as I can tell from your attached data, the overall throughput-latency relationship pattern looks similar to mine and is still unclear to me. My general understanding is that the relation between the latency, offered load and sustained load should look like this [1]:
For instance, benchmarking Nginx on my system with wrk2, yields the following (similar to above) curves:
Whereas Redis (with memtier_benchmark) yields:
The plot you attached looks similar to mine - the general direction of the latency curve as the sustained rate increases is downwards instead of upwards, and it never "explodes" when the sustained rate is far less than the requested rate.
I wish to understand the reason for this. In particular, why is the latency worse at very low rates compared to high rates, and why it is it not going upwards when the sustained rate is maximal.
I'm far from an expert on the subject, but I have a feeling that the latency calculation is not taking coordinated omission into consideration. My script (and your script) tested the system with a single thread, however I couldn't reach the expected theoretical behavior with multiple threads as well (also, with wrk2 the calculation works as expected with a single thread as well). If there's an alternative set of command line configurations which yields the expected theoretical behavior, that's OK too.
Thanks again!
[1]: https://perfdynamics.blogspot.com/2010/03/bandwidth-vs-latency-world-is-curved.html