Mutilate

Mutilate is a memcached load generator designed for high request rates, good tail-latency measurements, and realistic request stream generation.

Requirements

A C++0x compiler
scons
libevent
gengetopt
zeromq (optional)

Mutilate has only been thoroughly tested on Ubuntu 11.10. We'll flesh out compatibility over time.

Building

apt-get install scons libevent-dev gengetopt libzmq-dev
scons

Basic Usage

Type './mutilate -h' for a full list of command-line options. At minimum, a server must be specified.

$ ./mutilate -s localhost
#type       avg     min     1st     5th    10th    90th    95th    99th
read       52.4    41.0    43.1    45.2    48.1    55.8    56.6    71.5
update      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
op_q        1.5     1.0     1.0     1.1     1.1     1.9     2.0     2.0

Total QPS = 18416.6 (92083 / 5.0s)

Misses = 0 (0.0%)

RX   22744501 bytes :    4.3 MB/s
TX    3315024 bytes :    0.6 MB/s

Mutilate reports the latency (average, minimum, and various percentiles) for get and set commands, as well as achieved QPS and network goodput.

To achieve high request rate, you must configure mutilate to use multiple threads, multiple connections, connection pipelining, or remote agents.

$ ./mutilate -s zephyr2-10g -T 24 -c 8
#type       avg     min     1st     5th    10th    90th    95th    99th
read      598.8    86.0   437.2   466.6   482.6   977.0  1075.8  1170.6
update      0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0
op_q        1.5     1.0     1.0     1.1     1.1     1.9     1.9     2.0

Total QPS = 318710.8 (1593559 / 5.0s)

Misses = 0 (0.0%)

RX  393609073 bytes :   75.1 MB/s
TX   57374136 bytes :   10.9 MB/s

Suggested Usage

Real deployments of memcached often handle the requests of dozens, hundreds, or thousands of front-end clients simultaneously. However, by default, mutilate establishes one connection per server and meters requests one at a time (it waits for a reply before sending the next request). This artificially limits throughput (i.e. queries per second), as the round-trip network latency is almost certainly far longer than the time it takes for the memcached server to process one request.

In order to get reasonable benchmark results with mutilate, it needs to be configured to more accurately portray a realistic client workload. In general, this means ensuring that (1) there are a large number of client connections, (2) there is the potential for a large number of outstanding requests, and (3) the memcached server saturates and experiences queuing delay far before mutilate does. I suggest the following guidelines:

Establish on the order of 100 connections per memcached server thread.
Don't exceed more than about 16 connections per mutilate thread.
Use multiple mutilate agents in order to achieve (1) and (2).
Do not use more mutilate threads than hardware cores/threads.
Use -Q to configure the "master" agent to take latency samples at slow, a constant rate.

Here's an example:

memcached_server$ memcached -t 4 -c 32768
agent1$ mutilate -T 16 -A
agent2$ mutilate -T 16 -A
agent3$ mutilate -T 16 -A
agent4$ mutilate -T 16 -A
agent5$ mutilate -T 16 -A
agent6$ mutilate -T 16 -A
agent7$ mutilate -T 16 -A
agent8$ mutilate -T 16 -A
master$ mutilate -s memcached_server --loadonly
master$ mutilate -s memcached_server --noload \
    -B -T 16 -Q 1000 -D 4 -C 4 \
    -a agent1 -a agent2 -a agent3 -a agent4 \
    -a agent5 -a agent6 -a agent7 -a agent8 \
    -c 4 -q 200000

This will create 8164 = 512 connections total, which is about 128 per memcached server thread. This ought to be enough outstanding requests to cause server-side queuing delay, and no possibility of client-side queuing delay adulterating the latency measurements.

Command-line Options