Memcache-perf

Memcache-perf is a memcached load generator designed for high request rates, good tail-latency measurements, and realistic request stream generation.

Requirements

A C++0x compiler
libevent2 (get headers and install build-dev for memcached for rest)
zeromq

Tested on ubuntu 14.04,16.04,18.04 x86 64b and ARMv8.

Building

apt-get install libevent-dev libzmq3-dev
apt-get build-dep memcached
make

Basic Usage

Type './mcperf -h' for a full list of command-line options. At minimum, a server must be specified.

$ ./mcperf -s localhost
#type       avg     std     min      p5     p10     p50     p67     p75     p80     p85     p90     p95     p99    p999   p9999
read       62.2    24.1    54.7    59.1    59.4    61.5    62.3    62.8    63.0    63.3    63.5    63.8    68.8    80.2  1012.5
update      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
op_q        1.0     0.0     1.0     1.0     1.0     1.0     1.1     1.1     1.1     1.1     1.1     1.1     1.1     1.1     1.1

Total QPS = 16082.2 (80411 / 5.0s)

Misses = 0 (0.0%)
Skipped TXs = 0 (0.0%)

RX   19861517 bytes :    3.8 MB/s
TX    2894832 bytes :    0.6 MB/s
CPU Usage Stats (avg/min/max): 2.21%,0.38%,4.05%

mcperf reports the latency (average, minimum, and various percentiles) for get and set commands, as well as achieved QPS and network goodput. A separate thread is also keeping track of client CPU usage on the master. A warning will be issued if the master client CPU usage goes above 95%. In that case, it is recommended to add more machines as agents. If verbose (-v) flag is enabled on the an agent, it will report it's cpu usage as well.

To achieve high request rate, you must configure mcperf to use multiple threads, multiple connections, connection pipelining, or remote agents.

$ ./mcperf -s zephyr2-10g -T 24 -c 8
#type       avg     min     1st     5th    10th    90th    95th    99th
read      598.8    86.0   437.2   466.6   482.6   977.0  1075.8  1170.6
update      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
op_q        1.5     1.0     1.0     1.1     1.1     1.9     1.9     2.0

Total QPS = 318710.8 (1593559 / 5.0s)

Misses = 0 (0.0%)

RX  393609073 bytes :   75.1 MB/s
TX   57374136 bytes :   10.9 MB/s

Suggested Usage

Real deployments of memcached often handle the requests of dozens, hundreds, or thousands of front-end clients simultaneously. However, by default, mcperf establishes one connection per server and meters requests one at a time (it waits for a reply before sending the next request). This artificially limits throughput (i.e. queries per second), as the round-trip network latency is almost certainly far longer than the time it takes for the memcached server to process one request.

In order to get reasonable benchmark results with mcperf, it needs to be configured to more accurately portray a realistic client workload. In general, this means ensuring that (1) there are a large number of client connections, (2) there is the potential for a large number of outstanding requests, and (3) the memcached server saturates and experiences queuing delay far before mcperf does. I suggest the following guidelines:

Establish more than 50 connections per memcached server thread.
Don't exceed more than about 10 connections per mcperf thread.
Use multiple mcperf agents in order to achieve (1) and (2).
Do not use more mcperf threads than hardware cores/threads.

Here's an example:

memcached_server$ memcached -t 4 -c 32768
agent1$ mcperf -T 16 -A
agent2$ mcperf -T 16 -A
agent3$ mcperf -T 16 -A
agent4$ mcperf -T 16 -A
agent5$ mcperf -T 16 -A
agent6$ mcperf -T 16 -A
agent7$ mcperf -T 16 -A
agent8$ mcperf -T 16 -A
master$ mcperf -s memcached_server --loadonly
master$ mcperf -s memcached_server --noload \
    -B -T 16 -Q 1000 -D 4 -C 4 \
    -a agent1 -a agent2 -a agent3 -a agent4 \
    -a agent5 -a agent6 -a agent7 -a agent8 \
    -c 4 -q 200000

This will create 8*16*4 = 512 connections total, which is about 128 per memcached server thread. This ought to be enough outstanding requests to cause server-side queuing delay, and no possibility of client-side queuing delay adulterating the latency measurements.

Command-line Options