thustorage/Sherman

Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory

Sherman is a B+Tree on disaggregated memory; it uses one-sided RDMA verbs to perform all index operations. Sherman includes three techniques to boost write performance:

A hierarchical locks leveraging on-chip memory of RDMA NICs.
Coalescing dependent RDMA commands
Two-level version layout in leaf nodes

For more details, please refer to our paper:

[SIGMOG'22] Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory. Qing Wang and Youyou Lu and Jiwu Shu.

Update (2024.10)

Please use Deft for evaluation, which improving Sherman in performance and correct synchronization.

System Requirements

Mellanox ConnectX-5 NICs and above
RDMA Driver: MLNX_OFED_LINUX-4.7-3.2.9.0 (If you use MLNX_OFED_LINUX-5**, you should modify codes to resolve interface incompatibility)
NIC Firmware: version 16.26.4012 and above (to support on-chip memory, you can use ibstat to obtain the version)
memcached (to exchange QP information)
cityhash
boost 1.53 (to support boost::coroutines::symmetric_coroutine)

Setup about RDMA Network

1. RDMA NIC Selection.

You can modify this line according the RDMA NIC you want to use, where ibv_get_device_name(deviceList[i]) is the name of RNIC (e.g., mlx5_0)

Sherman/src/rdma/Resource.cpp

Line 28 in 9bb9508

if (ibv_get_device_name(deviceList[i])[5] == '0') {

2. Gid Selection.

If you use RoCE, modify gidIndex in this line according to the shell command show_gids, which is usually 3.

Sherman/include/Rdma.h

Line 60 in c5ee9d8

bool createContext(RdmaContext *context, uint8_t port = 1, int gidIndex = 1,

3. MTU Selection.

If you use RoCE and the MTU of your NIC is not equal to 4200 (check with ifconfig), modify the value path_mtu in src/rdma/StateTrans.cpp

4. On-Chip Memory Size Selection.

Change the constant kLockChipMemSize in include/Commmon.h, making it <= max size of on-chip memory.

Getting Started

cd Sherman
./script/hugepage.sh to request huge pages from OS (use ./script/clear_hugepage.sh to return huge pages)
mkdir build; cd build; cmake ..; make -j
cp ../script/restartMemc.sh .
configure ../memcached.conf, where the 1st line is memcached IP, the 2nd is memcached port

For each run with kNodeCount servers:

./restartMemc.sh (to initialize memcached server)
In each server, execute ./benchmark kNodeCount kReadRatio kThreadCount

We emulate each server as one compute node and one memory node: In each server, as the compute node, we launch kThreadCount client threads; as the memory node, we launch one memory thread. kReadRatio is the ratio of get operations.

In ./test/benchmark.cpp, we can modify kKeySpace and zipfan, to generate different workloads. In addition, we can open the macro USE_CORO to bind kCoroCnt coroutine on each client thread.

Known bugs

The two-level version may induce inconsistency in some concurrent cases. Refer to this SIGMOD'23 paper

TODO

Re-write delete operations