/memento

Memento: A Framework for Detectable Recoverability in Persistent Memory (PLDI 2023)

Primary LanguageRustCreative Commons Attribution 4.0 InternationalCC-BY-4.0

Memento: A Framework for Detectable Recoverability in Persistent Memory

This is the artifact for the following paper:

Kyeongmin Cho, Seungmin Jeon, Azalea Raad, Jeehoon Kang. Memento: A Framework for Detectable Recoverability in Persistent Memory. PLDI 2023.

Contributions (paper §1)

  • In §2, we describe how to design programs that are deterministically replayed after a crash. We do so using two primitive operations, detectably recoverable checkpoint and CAS, by composing them with usual control constructs such as sequential composition, conditionals, and loops.
  • In §3, we design a core language for persistent programming and its associated type system for deterministic replay, and prove that well-typed programs are detectably recoverable.
  • In §4, we present an implementation of our core language in the Intel-x86 Optane DCPMM architecture. Our construction is not tightly coupled with Intel-x86, and we believe that our implementation can be straightforwardly adapted to other PM architectures.
  • In §5, we adapt several volatile, lock-free data structures (DSs) to satisfy our type system, automatically deriving detectable, persistent lock-free DSs. These include a detectable, persistent linked-list Harris 2001, Treiber stack Treiber 1986, Michael-Scott queue Michael and Scott 1996, a combining queue, and Clevel hash table Chen et al. 2020. In doing so, we capture the optimizations of hand-tuned persistent lock-free DSs with additional primitives and type derivation rules (§B and §C), and support safe memory reclamation even in the presence of crashes (§D).
  • In §6, we evaluate the detectability and performance of our CAS and automatically derived persistent DSs. They recover from random thread crashes in stress tests (§6.1); and perform comparably with the existing persistent DSs with and without detectability (§6.2).

Artifacts

  • Implementation of the Memento framework and its primitives (§4 : src/pmem/ and src/ploc/)
  • Implementation of several detectably persistent DSs based on Memento (§5 : src/ds/)
  • Evaluation programs (correctness and performance) (§7 : evaluation/)
  • Full result data of benchmark (§7 : evaluation_data/ in Zenodo)

Getting Started Guide

You can either reuse a pre-built docker image memento-image.tar from our Zenodo archive or manually build the framework.

Requirements

  • Ubuntu 20.04 or later
  • Intel® Optane™ Persistent Memory 100 Series (mounted at /mnt/pmem0).
    • In case that a persistent memory is not mounted, you can still perform a limited evaluation on DRAM.

Option 1: Running on Docker (Loading Docker Image)

You can reuse a pre-built docker image by loading memento-image.tar:

docker load -i memento-image.tar
docker run -it -v /mnt/pmem0:/mnt/pmem0 --cap-add=SYS_NICE memento  # Assuming persistent memory is mounted at /mnt/pmem0

Here, -v /mnt/pmem0:/mnt/pmem0 option is conditionally required to share the mounted persistent memory area with the container for the full evaluation. Also, --cap-add=SYS_NICE option is needed to evalute performance by unifying all used cores into a single numa node.

You can re-build a docker image by docker build -t memento .. (It may take more than 30 minutes.)

Option 2: Running on host

Dependencies

  • Rust

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  • Additional dependencies for evaluation:

    apt install build-essential clang python3-pip numactl \
      libpmemobj-dev libvmem-dev libgflags-dev \
      libpmemobj1 libpmemobj-cpp-dev \
      libatomic1 libnuma1 libvmmalloc1 libvmem1 libpmem1
    pip3 install --user pandas matplotlib gitpython

Build

To build our framework including detectable operations, DSs and SMR libraries:

git submodule update --init --recursive
(cd ext/pmdk-rs; git apply ../pmdk-rs.patch)
cargo build --release

If persistent memory is not mounted on your machine, add a feature flag with no_persist as follows:

cargo build --release --features no_persist

Step-by-Step Instructions

This artifact aims to achieve the following goals:

  • G1: Locating our framework's core concepts (§4,5,B,D) in the development
  • G2: Reproducing the detectability evaluation (§6.1)
  • G3: Reproducing the performance evaluation (§6.2)

G1: Locating our framework's core concepts (§4,5,B,D) in the development

  • src/ploc/: persistent memory (PM) infrastructure and primitive operations (§4, §B)
  • src/ds/: Memento-based persistent, detectable DSs supporting exactly-once semantics (§5)
  • crossbeam-persistency/: safe memory reclamation scheme (§D)

PM Infrastructure (§4.1)

Primitive Operations (§4, §B)

Concurrent Data Structures (§5)

  • src/ds/comb.rs: A Memento-based detectable combining operation. We convert the original PBComb to one using mementos to support multi-time detectability. (Comb-mmt)
  • src/ds/list.rs: A Memento-based lock-free list that uses DetectableCas and Checkpoint based on Harris' ordered linked list. (List-mmt)
  • src/ds/treiber_stack.rs: A Memento-based lock-free stack that uses DetectableCas and Checkpoint based on Treiber's stack. (TreiberS-mmt)
  • src/ds/queue_general.rs: A Memento-based lock-free queue that uses DetectableCas and Checkpoint based on Michael-Scott Queue. (MSQ-mmt-O0)
  • src/ds/queue_lp.rs: A Memento-based lock-free queue that uses Insert, Delete and Checkpoint. The difference from queue.rs is that this queue uses general link-persist technique rather than exploits DS-specific invariant for issuing less flushes when loading shared pointer. (MSQ-mmt-O1)
  • src/ds/queue_comb.rs: A Memento-based combining queue that uses Combining operation. (CombQ-mmt)
  • src/ds/clevel.rs: A Memento-based Clevel extensible hash table. We convert original Clevel to one using mementos. (Clevel-mmt)
  • src/ds/queue.rs: A Memento-based lock-free queue that uses Insert, Delete and Checkpoint based on Michael-Scott Queue. (MSQ-mmt-O2)

Safe Memory Reclamation (§D)

G2: Reproducing the detectability evaluation (§6.1)

Thread Crash Test

We evaluate the detectability in case of thread crashes by randomly crashing an arbitrary thread while running the integration test. To crash a specific thread, we use the tgkill system call to send the SIGUSR1 signal to the thread and let its signal handler abort its execution.

Install
cd evaluation/correctness/tcrash
./build.sh # specially build for the thread crash test
Run

You can test each DS with the following command:

./run.sh [tested DS]

where tested DS should be replaced with one of supported tests (listed below). For example, the following command is to infinitely check that the test of MSQ-mmt-O0 in the paper always pass in case of an unexpected thread crash:

./run.sh queue_general

Then the output is printed out like below:

clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 1 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
[Test 1] success
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 2 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
[Test 2] success
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 3 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
[Test 3] success
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 4 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
^C

It also creates a short progress log and a full test log under ./out.

If a bug exists (just for an example), the output is like below:

clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 1 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
./run.sh: line 51: 855011 Aborted                 RUST_BACKTRACE=1 RUST_MIN_STACK=2000000000 numactl --cpunodebind=0 --membind=0 timeout $TIMEOUT $SCRIPT_DIR/../../target/x86_64-unknown-linux-gnu/release/deps/memento-* $target::test --nocapture &>> $log_tmp
fails with exit code 134
[Test 1] fails with exit code 134
clear queue_general
⎾⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺ thread crash-recovery test queue_general 2 (retry: 0) ⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⏋
run queue_general
^C

It then generates a bug directory consisting of a text file containg specific error log (info.txt) and a PM pool files (queue_general.pool_*) of the buggy execution so that we can debug the DS using it.

For each primitive and DS, we observe no test failures for 1M runs with thread crashes.

Supported tests
For primitives
  • checkpoint
  • detectable_cas
For data structures
  • queue_general: MSQ-mmt-O0 (in the paper)
  • queue_lp: MSQ-mmt-O1
  • queue: MSQ-mmt-O2
  • queue_comb CombQ-mmt
  • treiber_stack: TreiberS-mmt
  • list: List-mmt
  • clevel: Clevel-mmt

Persistency Bug Finding Test (Yashme/PSan)

We evaluate the correctness of our primitives and DSs using existing bug finding tools, Yashme and PSan. They are finding persistent bugs such as persistency race, missing flushes based on model checking framework Jaaru.

Install
cd evaluation/correctness/pmcheck
./scripts/build_pmcpass.sh # may take more than 10 minutes to build LLVM
./build.sh
Run

You can test each DS with the following command:

./run.sh [tested DS] [tool] [mode]

where

  • tested DS should be replaced with one of supported tests (listed below).
  • tool: yashme or psan
  • mode: random or model (random testing mode or model checking mode, respectively)

For example, the following command is to test the MSQ-mmt-O0 using PSan with random mode:

./run.sh queue_O0 psan random

Then the output is printed out like below:

Jaaru
Copyright (c) 2021 Regents of the University of California. All rights reserved.
Written by Hamed Gorjiara, Brian Demsky, Peizhao Ou, Brian Norris, and Weiyu Luo

Execution 1 at sequence number 198
nextCrashPoint = 83987	max execution seqeuence number: 88289
nextCrashPoint = 2876	max execution seqeuence number: 4161
Execution 2 at sequence number 4161
nextCrashPoint = 1106	max execution seqeuence number: 4171
nextCrashPoint = 1583	max execution seqeuence number: 4181
Execution 3 at sequence number 4181
nextCrashPoint = 3756	max execution seqeuence number: 4166
nextCrashPoint = 31	max execution seqeuence number: 4176
Execution 4 at sequence number 4176
nextCrashPoint = 2400	max execution seqeuence number: 4181

...

******* Model-checking complete: *******
Number of complete, bug-free executions: 10
Number of buggy executions: 0
Total executions: 10

For each primitive and DS, we observe no buggy executions for 1K runs with random mode.

Supported tests
For primitives
  • checkpoint
  • detectable_cas
For data structures
  • queue_O0: MSQ-mmt-O0 (in the paper)
  • queue_O1: MSQ-mmt-O1
  • queue_O2: MSQ-mmt-O2
  • queue_comb CombQ-mmt
  • treiber_stack: TreiberS-mmt
  • list: List-mmt
  • clevel: Clevel-mmt

G3: Reproducing the performance evaluation (§6.2)

Performance Evaluation of CAS

We evaluate the performance of CASes with our benchmark. Each implementation of comparison targets exists in evaluation/performance/cas/src/.

Install
cd evaluation/performance/cas
./build.sh
Run the entire benchmark
./run.sh  # This may take about 3 hours

This creates CSV data and plots under ./out/.

Run a single benchmark

You can run a single benchamrk,

./target/release/cas_bench -f <filepath> -a <target> -c <locations> -t <threads> -o <output>

where

  • target: mcas (CAS-mmt at paper), pmwcas, nrlcas
  • locations: number of locations

For example, following command measure the throughput and memory usage of mcas when using 1000 locations and 16 threads.

./target/release/cas_bench -f /mnt/pmem0/mcas.pool -a mcas -c 1000 -t 16 -o ./out/cas-mmt.csv
  • This creates raw CSV data under ./out/cas-mmt.csv.
  • To pinning NUMA node 0, you should attach numactl --cpunodebind=0 --membind=0 at the front of the command.

For detailed usage information,

./target/release/cas_bench -h

Performance Evaluation of List

We evaluate the performance of Memento-based list compared to other detectable list. Each implementation of comparison targets exists in evaluation/performance/list/src/. To evaluate the performance of detectable list based on Tracking, Capsule, Casule-Opt, we use the implementations published by Detectable Recovery of Lock-Free Data Structures (PPoPP '22) authors.

Install
cd evaluation/performance/list
./build.sh
Run the entire benchmark
./run.sh  # This may take about 7 hours

This creates CSV data and plots under ./out/.

Run a single benchmark
List-mmt

You can run a single benchamrk for list-mmt,

./target/release/bench -f <filepath> -a list-mmt  -t <threads> -k <key-range> --insert-ratio <insert-ratio> --delete-ratio <delete-ratio> --read-ratio <read-ratio> -o <outpath>

For example, following command measure the throughput of list-mmt with read-intensive workload, when using 16 threads and 500 key ranges.

./target/release/bench -f /mnt/pmem0/list-mmt.pool -a list-mmt -t 16 -k 500 --insert-ratio 0.15 --delete-ratio 0.15 --read-ratio 0.7 -o ./out/list-mmt.csv
  • This creates raw CSV data under ./out/list-mmt.csv.
  • To pinning NUMA node 0, you should attach numactl --cpunodebind=0 --membind=0 at the front of the command.

For detailed usage,

./target/release/bench -h
Tracking, Capsules, Capsules-Opt

We refer to https://github.com/ConcurrentDistributedLab/Tracking.

Performance Evaluation of Queue

We evaluate the performance of Memento-based queues and other queues. Each implementation of comparison targets exists in evaluation/performance/queue/src/.

Install
cd evaluation/performance/queue
./build.sh
Run the entire benchmark
./run.sh  # This may take more than 14 hours

This creates CSV data and plots under ./out/.

Run a single benchmark

You can run a single benchamrk,

./target/release/bench -f <filepath> -a <target> -k <kind> -t <threads> -i <init_nodes> -o <output>

where

  • target: memento_queue (MSQ-mmt-O2 in the paper), memento_queue_lp (MSQ-mmt-O1 in the paper), memento_queue_general (MSQ-mmt-O0 in the paper), memento_queue_comb (CombQ-mmt in the paper), durable_queue, log_queue, dss_queue, pbcomb_queue, crndm_queue
  • kind: pair (enq-deq pair), prob{n} (n% probability enq or 100-n% deq)

For example, following command measure the throughput of memento_queue with pair workload, when using 16 threads.

./target/release/bench -f /mnt/pmem0/mmt.pool -a memento_queue -k pair -t 16 -i 0 -o ./out/mmt.csv
  • This creates raw CSV data under ./out/mmt.csv.
  • To pinning NUMA node 0, you should attach numactl --cpunodebind=0 --membind=0 at the front of the command.

For detailed usage information,

./target/release/bench -h
Benchmarking PMDK and Clobber-NVM queue

To run a single benchmark for PMDK and Clobber-NVM queues, you should use separate executables with the following commands.

PMDK queue:

./target/release/bench_cpp <filepath> <target> <kind> <threads> <duration> <init_nodes> <output> # <target> should be "pmdk_queue"

Clobber-NVM queue:

PMEM_IS_PMEM_FORCE=1 ./src/clobber-nvm/apps/queue/benchmark-clobber -k <kind> -t <threads> -d 8 -s <duration> -i <init_nodes> -o <output>

Performance Evaluation of Hash

We used the same benchmark as Persistent Memory Hash Indexes: An Experimental Evaluation (VLDB '21) to evaluate our hash. Each implementation of comparison targets exists in evaluation/performance/hash/hash/.

Install
ulimit -s 8192000
cd evaluation/performance/hash
./build.sh
Run the entire benchmark
./run.sh  # This may take about 30 hours

This creates raw txt that containing measuring result and plots under ./out/.

Run a single benchmark

You can run a single benchamrk with PiBench executable,

cd bin
./PiBench [lib.so] [args...]

where

For example, following command measure the search throughput of clevel_rust when using 32 threads with uniform distribution.

./bin/PiBench ./bin/clevel_rust.so \
    -S 16777216 \       # initial capacity
    -p 200000000 \      # number of operations
    -r 1 -i 0 -d 0 \    # read 100%, insert 0%, delete 0%
    -M THROUGHPUT --distribution UNIFORM \
    -t 32 \

Use PMDK allocator

You can evaluate clevel_rust on top of the PMDK allocator (instead of Ralloc) by appending pmdk to the build command.

For example:

./build.sh pmdk # This builds clevel_rust on the top of PMDK allocator