A framework to understand RDMA performance. This is the source code for our USENIX ATC paper.
- InfiniBand HCAs. RoCE HCAs have not been tested, but should work with minor
modifications. See
is_roce()
usage in HERD for details. - Ubuntu 12.04+ with Mellanox OFED 2.4+. Some benchmarks have been tested with CentOS 7 with OpenFabrics OFED.
- memcached, libmemcached-dev, libmemcached-tools
- libnuma-dev (numactl-devel on CentOS)
All benchmarks require one server machine and multiple client machines. Every benchmark is contained in one directory.
- The number of client machines required is described in each benchmark's README file. The server will wait for all clients to launch, so the benchmarks won't make progress until the correct number of clients are launched.
- Modify
HRD_REGISTRY_IP
inrun-servers.sh
andrun-machines.sh
to the IP address of the server machine. This machine runs a memcached instance that is used as a queue pair registry. - Allocate hugepages on the NIC's socket at the server machine. On our machines, the NIC is attached to socket 0, so all benchmark scripts bind allocated memory and threads to socket 0. On Ubuntu systems, create 8192 hugepages on socket 0 using:
sudo echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
The benchmarks used in the paper are described below. This repository contains a few other benchmarks as well.
Benchmark | Description |
---|---|
herd |
An improved implementation of the HERD key-value cache. |
mica |
A simplified implementation of MICA. |
atomics-sequencer |
Sequencer using one-sided fetch-and-add. Also emulates DrTM-KV. |
ws-sequencer |
Sequencer using HERD RPCs (UC WRITE requests, UD SEND responses). |
ss-sequencer |
Sequencer using header-only datagram RPCs (i.e., UD SENDs only). |
rw-tput-sender |
Microbenchmark to measure throughput of outbound READs and WRITEs. |
rw-tput-receiver |
Microbenchmark to measure throughput of inbound READs and WRITEs. |
ud-sender |
Microbenchmark to measure throughput of outbound SENDs. |
ud-receiver |
Microbenchmark to measure throughput of inbound SENDs. |
rw-allsig |
WQE cache misses for outbound READs and WRITEs. |
The libhrd
library is used to implement all benchmarks. It consists of
convenience functions for initial RDMA setup, such as creating and connecting
QPs, and allocating hugepage memory.
Distributing QP information (required for connection setup in connected
transports, and routing in datagram transports) requires a temporary out-of-band
communication channel. To simplify this process, we use a memcached
instance
to publish (e.g., hrd_publish_conn_qp()
) and pull QP information (e.g.,
hrd_get_published_qp
) using global QP names.
The code was written to work on a cluster that has dual-port NICs, but the switch connectivity does not allow cross-port communication. Using both ports in this constrained environment makes the initial QP connection setup slightly complicated. All benchmarks also work on single-port NICs. Usually, we use the following logic while setting up connections:
- There are
N
client threads in the system and each client thread usesQ
QPs. - The server has
num_server_ports
ports starting from portbase_port_index
. Similarly, clients havenum_client_ports
. Thebase_port_index
may be different for server and clients. - On the CIB cluster, port
i
on a NIC can only communicate with porti
on other NICs. Sobase_port_index
must be same for clients and server, andnum_client_ports == num_server_ports
. - One server thread (the master thread in case there are worker threads) creates
N * Q
QPs on each server port. For applications requiring a request region, only one memory region is created and registered with all of thenum_server_ports
control blocks. Only some of these QPs actually get used by clients. - Client threads have a global index
clt_i
. Each client thread uses a single control block and creates all its QPs on port index (using basebase_port_index
)clt_i % num_client_ports
. It connects all these QPs to QPs on server port indexedclt_i % num_server_ports
(using the server'sbase_port_index
). This works for both CIB, and Apt and Intel clusters that support any-to-any communication between ports.
Most benchmarks post one signaled work request per UNSIG_BATCH
work requests.
This is done to reduce CQE DMAs. With UNSIG_BATCH = 4
, a sequence of work
requests looks as follows. Note that a work request is not post()
ed
immediately; it is added to a list and posted when the number of work requests
in the list equals postlist
.
wr 0 -> signaled
wr 1 -> unsignaled
wr 2 -> unsignaled
wr 3 -> unsignaled
Poll for wr 0's completion. A postlist should have ended.
wr 4 -> signaled
...
wr 5 -> unsignaled
Poll for wr 4's completion. Another postlist should have ended.
This imposes 2 requirements:
-
Postlist check:
postlist <= UNSIG_BATCH
. We poll for a completion before queueing work requestUNSIG_BATCH + 1
. Ifpostlist > UNSIG_BATCH
, nothing will have been posted at this point, so polling will get stuck. -
Queue capacity check:
HRD_Q_DEPTH >= 2 * UNSIG_BATCH
. With the above scheme, up to2 * UNSIG_BATCH - 1
work requests can be un-ACKed by the QP. With a QP of sizeN
,N - 1
work requests are allowed to be un-ACKed by the InfiniBand/RoCE specification.
Anuj Kalia (akalia@cs.cmu.edu)
Copyright 2016, Carnegie Mellon University
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.