/ycsb-storedsbench

ycsb stored data structure benchmark suite for pmem and dram

Primary LanguageC++MIT LicenseMIT

pmemids_bench

pmemids_bench is a C++ version of YCSB based benchmark suite that includes seven commonly used indexing data structures implemented in four persistent modes and four parallel modes (shown in later table). You can learn more about this project in our MSST 2020 paper.

Table of Contents

Quick Start
Benchmark Design and Implementation
Benchmarking Results and Analysis
Usage
Contribution
Credit

Relevant publications:

  • Islam, Abdullah Al Raqibul and Dai, Dong. "Understand the overheads of storage data structures on persistent memory," PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020. doi:10.1145/3332466.3374509
  • Islam, Abdullah Al Raqibul and Dai, Dong and Narayanan, Anirudh and York, Christopher. "A Performance Study of Optane Persistent Memory: From Indexing Data Structures’ Perspective," MSST '20: 36th International Conference on Massive Storage Systems and Technology, 2020.

Quick Start

Prerequisites

  • GNU Make
  • C++ (we used version 11)
  • PMDK (we used version 1.8)
  • external libraries: pthreads, pmemobj, vmem, memkind

Build & Run

To build pmemids_bench:

$ make

Run Workload A with a dram-based array implementation in single thread:

./ycsbc -db storeds -threads 1 -dbpath /pmem -type array-dram -P workloads/workloada.spec

You can run the program without any arguments (i.e. ./ycsbc) to see the command line argument help information.

Note that we do not have load and run commands as the original YCSB. Specify how many records to load by the recordcount property. Reference properties files in the workloads dir.

Benchmark Design and Implementation

Indexing Data Structures

Data Structure Description Persistent Modes Parallel Modes Workloads
Arraylist Fixed items/size DRAM

PMEM-Volatile

PMEM-Persist

PMEM-Trans
Single thread

Parallel, Saturated

Concurrent, Contention

NUMA
A (100% Read)

B (100% Write)
(90% Update, 10% Insert)
Linkedlist Appended only
Hashtable Chained Linkedlist
Skiplist Fixed height, fair probability
B-Tree Values referenced/embedded
B+-Tree Values in the leaf nodes only
RB-Tree Instant rebalancing
Table 1: Summary of indexing data structures, their different running modes under various workloads included in the benchmark suite.

Workloads

To help benchmark the storage systems, typical YCSB benchmarks include some representative workloads such as the five workloads (A to E) shown in the top part of Table 2.

However, in pmemids_bench, this is not feasible anymore, simply because unlike benchmarking storage systems, we do not know how developers will use the indexing data structures. So, in our implementation, we leverage the YCSB workload generator, but we do create different workloads for our evaluations, as placed in the bottom part of Table 2.

There are only two workloads included in pmemids_bench by default. But developers can configure their own workloads as we have placed several sample of them in "workloads" directory (i.e. check files with ".spec" file extension). We also reserved the original YCSB benchmarks, you can check it here.

YCSB Workload A B C D E
Read
Update
Insert
Read & Update
50
50
-
-
95
5
-
-
100
-
-
-
95
-
5
-
50
-
-
50
pmemids_bench Workload A (100% Read) B (100% Write)
Read
Update
Insert
100
-
-
-
90
10
Table 2: Workloads in % of differebt operations.

Benchmarking Results and Analysis

Evaluation Platform

We tested our benchmark suiteon a Dell R740 rack server with two sockets. Each socket installs a 2nd generation Intel Xeon Scalable Processor (Gold 6254 @ 3.10G) with 18 physical (36 virtual) cores. The machine is running Ubuntu 18.04 with a Linux kernel version 4.15.0. To enable the NUMA evaluation, we put all the DRAM and Optane DC DIMMS into one socket (node 0). This socket has 6 DRAM DIMMS with 32GB each and 1 Optane DC DIMM with 128GB. In all the evaluations, except the NUMA ones, we bind all the threads to node 0 to enable local memory accesses. The total memory capacity is 192GB DRAM and 128GB Optane DC. We used PMDK 1.8 in the implementation.

Benchmarking Results and Observations

Read/Write performance

single_thread_100%_write

single_thread_100%_read

Fig. 3. The benchmark results of seven indexing data structures in a single thread case. The x-axis lists all the persistent modes and y-axis shows the throughput (KTPS or k transactions per second) after running the workloads. We tested all five workloads, and showed two of them (100% read and 100% write) here. Note that, all the experiments were repeated 10 times. We plot both the average and the variations in the figure.

Transaction overhead

btree_version_compare

Fig. 5. The performance of two B-Tree implementations (value referenced vs. value embedded) in different persistent modes. The tree node size changes from 256 byts to 2K btyes.

Scalability performance

parallel_thread_100%_write

parallel_thread_100%_read

Fig. 6. The benchmark results of seven indexing data structures in the Parallel, Saturated mode. The x-axis is the thread number; the y-axis is the aggregated throughput (KTPS) across all threads. We used up to 16 threads to make sure each thread is running in a physical core. We show both 100% write and 100% read workloads here.

Concurrent/Contention performance

parallel_thread_100%_write
concurrent_thread_1 concurrent_thread_16
Fig. 7. The 100% write performance comparison of two cases running in the Concurrent, Contention mode. Here, x-axis groups indexing data structures and each group contains results of different persistent modes. The y-axis shows the throughput in KTPS.

Lock persistence overhead

rbtree_lock_persistence_legend

rbtree_lock_persistence

Fig. 8. The 100% write performance comparison of four different mutex lock usages in PMEM-based RBTree, with thread number increases from 1 to 16

NUMA overhead

numa_100%_write

numa_100%_read

Fig. 9. The benchmark results of seven indexing data structures in different persistent modes accessing local memory and remote memory. All evaluations are based on single thread.

Usage

One of the key motivations of this project is to allow others to reuse our framework and implement their specific data structures for performance comparison (using YCSB workload). Please have a look at our wiki page to learn how to add new data structures to this project. From the Workloads section, you will get some idea about workload generation.

The project is well suited for testing third-party libraries. We helped TwoMisses (an updated version of ThreeMisses, the winner of the first SNIA NVDIMM Challenge!) to benchmark their library. The integration is done in the branch threemiss.

Contribution

If you would like to contribute to pmemids_bench, we would certainly appreciate your help! Here are some of the ways you can contribute:

  • Bug fixes, whether they be for performance or correctness of the existing data structure implementation
  • Improvement of the existing benchmark implementation and documentation
  • Support for additional data structure implementations; we are open to adding more data structures in this repo

Our future goal is to provide a set of portable, high-performance PMEM-aware data structure baselines. For code contributions, please focus on code simplicity and readability. If you open a pull request, we will do an immediate sanity check and respond to you as soon as possible.

Credit

  • We use Jinglei Ren's (Github handle: basicthinker) C++ implementation of YCSB benchmark YCSB-C as the base of our project.