vthreads-numa-lock.

Effective lock implementation for virtual threads on NUMA architecture in Java, which is called NUMA_MCS Also my bachelor's diploma at ITMO university. Also my research project.

Goal of the project

Researching, implementing and benchmarking NUMA aware locks for light (virtual) threads in Java. The target platform is Kunpeng-920 which has ARM-architecture.

This work could not be done without help of @Aksenov239, @anton-malakhov, @AndreyChurbanov.

JDK patch (legacy, unsuccessful try to make results better)

ricnorr/jdk19#3

How to run benchmarks

Install Gradle and Java

Run ./scripts/install_java.sh

Prepare conda,python,native libraries

Run ./scripts/activate-conda.sh from project's root to install miniconda
Run ./scripts/create-conda-envs.sh from project's root to install python libraries
Run ./scripts/build_libs.sh from project's root to compile native library

How to start benchmarks

Run ./scripts/run_bench.sh from the project's root
Results are saved in results/benchmark_results.csv
Pictures saved in results/pictures
Pdf saved in results/result.pdf

Main ideas of the research

Benchmark description

Benchmarks with label “48 cores” in title are measured on a system with HiSilicon Kunpeng-920, 48 cores, 2 NUMA nodes. Benchmarks with label “128 cores” in title are measured on a system with HiSillocon Kunpeng-920, 128 cores, 2 NUMA nodes.

Work is shared between threads, each thread in cycle does a work: multiply matrices (square matrices of size “out of c.s”), yields, acquires lock, multiple matrices (square matrix of size “in c.s”, which means in critical section), yields, releases lock. The throughput is a ratio of iterations number sum and total time.

The main idea of the lock

NUMA_MCS consists of many MCS queues and boolean flag. Vthread tries to execute a CAS operation on a flag. If the operation is successful, the vthread becomes a lock owner, otherwise “slow path” happens. Slow path of lock acquiring consists of receiving NUMA id and joining the appropriate MCS queues. When the vthread is a leader of the local MCS queue, it waits for successful CAS on the flag. Scheme of lock acquiring is presented on the picture 1.

Picture 1. NUMA_MCS lock acquisition. VT1 from numa2 executes unsuccessful CAS and joins the local queue.

When the vthread releases the lock, it sets the flag to false and unpark the next vthread in the local queue.
The main problem of implementing effective locks for virtual thread on NUMA is the effect, when vthread is yielded inside critical section and resumed on another NUMA. Virtual threads migrate between NUMA nodes more often than platform threads. So NUMA_MCS expects that sometimes we are lucky and vthread enters and leaves critical section on the same NUMA. In the picture 2 presented benchmarks, comparing lock with local queue for each NUMA (NUMA_MCS) and NUMA_MCS with just one queue (NUMA_MCS_ONE_QUEUE).

Picture 2. Multiplication of matrices 50x50 inside the critical section, and doing nothing outside of the critical section. Very high contention.

How to cache NUMA id for vthread

To get NUMA id you can perform syscall getcpu and cache the information. I suggest to cache the value on the carrier thread side and sometimes update information.

Use @Contended to solve false-sharing

This annotation is used to split data to different cache-lines and is a preferred way to solve the false-sharing problem. The cache line size can be set when initialising a VM.

Comparing with standard locks

Below presented benchmarks comparing NUMA_MCS and ReentrantLock from the Java standard library. Results show that NUMA_MCS is more effective than ReentrantLock.

ricnorr/vthreads-numa-lock