/atomic_queues

Fast bounded MPMC and SPSC queues for C++20

Primary LanguageC++MIT LicenseMIT

atomic_queues

This repository contains a single header-file, atomic_queues.hpp, which contains bounded MpmcQueue and SpscQueue implementations for C++20. The SPSC queue appears to be the best performing SPSC queue, while the MPMC queue appears to be the best performing implementation for N threads <= N cores (and is by far the best performing at low contention).

If you require high performance at N threads > N cores, you may want to use a queue such as moodycamel's ConcurrentQueue. I am working on an original "composite" MMPMC queue implementation which will eventually be added to this repo and will hopefully be the top performing queue at high contention.

Note that performance varies greatly by system and the benchmarks in this README are not realistic, benchmark queues yourself if trying to optimize performance.

The implementation of the MPMC queue is based on Dmitry Vyukov's bounded MPMC queue, and the implementation of the SPSC queue is based on Erik Rigtor's SpscQueue. Modifications have been made to these implementations in order to improve performance and configurability.

Usage and Configuration

Both queues' templates take the following parameters:

<typename T, size_t N = 0, ValidSizeConstraint SizeConstraint = jdz::EnforcePowerOfTwo>
  • typename T: The type stored in the queue.
  • size_t N = 0: the capacity of the queue - providing this as a template parameter allows the compiler to optimize the modulo operation, which may greatly improve performance at low contention (see benchmarks), and allows for compile-time checking of the parameter's validity*.
  • ValidSizeConstraint SizeConstraint = jdz::EnforcePowerOfTwo: Constraining the capacity to power of two sizes allows for the use of a bitwise AND operation instead of a modulo, which significantly improves performance at low contention, including with runtime-provided buffer sizes. For template-provided buffer sizes, this will likely be applied automatically by the compiler when possible (but it can still be good to enforce it). Options are jdz::EnforcePowerOfTwo and jdz::DoNotEnforcePowerOfTwo.

The constructor's buffer_size argument must be provided if N is equal to 0, and follows the same restrictions as N*. Passing in a custom allocator allows for use of the queue with huge pages or in shared memory.

*Buffer capacity must be greater than 1, and must respect the power of two size constraint. If N is provided, then the buffer_size_ constructor argument must be either 0 or equal to N.

Both queues implement the following interface:

Queue(size_t buffer_size = N;
      std::allocator<jdz::details::Cell<T>> allocator = std::allocator<jdz::details::Cell<T>>());

template <typename... Args>
void emplace(Args &&... args) noexcept; // blocking 

void push(const T &data) noexcept
requires std::is_nothrow_copy_constructible_v<T>; // blocking

template <typename P>
void push(P &&data) noexcept
requires std::is_nothrow_constructible_v<T, P>; // blocking

template <typename... Args>
[[nodiscard]] bool try_emplace(Args &&... args) noexcept
requires std::is_nothrow_constructible_v<T, Args &&...>; // non-blocking

bool try_push(const T &data) noexcept
requires std::is_nothrow_copy_constructible_v<T>; // non-blocking

template <typename P>
[[nodiscard]] bool try_push(P &&data) noexcept
requires std::is_nothrow_constructible_v<T, P>; // non-blocking

void pop(T &v) noexcept; // blocking;
    
[[nodiscard]] bool try_pop(T &v) noexcept; // non-blocking

[[nodiscard]] size_t size() const noexcept;

[[nodiscard]] size_t capacity() const noexcept;

[[nodiscard]] bool empty() const noexcept;

The blocking methods perform better than the non-blocking methods when not-preempted. If number of threads is larger than the number of cores and preemption becomes an issue, you may achieve better performance with a retry loop over the try* methods, although I would recommend using a different library in this situation.

Benchmarks

Benchmarks are included in the src/ folder of this repository, testing various versions of the queue implementations. These are currently not especially rigorous and will be improved.

Posted below are the results of benchmarking the implementations found in this repository against other notable implementations. Benchmarks are run using the blocking read/write methods where possible, ie pop and push for jdz queues. Note that these may perform differently than the non-blocking methods - ie, for vyukov style queues* (such as jdz implementations and rigtorp), these perform better as long as N threads <= N cores but perform worse than the non-blocking methods below that.

These benchmarks involve submitting small data types (uint64_t) between equal numbers of producers and consumers producing and consuming at max throughput - this is not a realistic benchmark. Contention in real systems is likely lower for same number of threads, and it is important to test your use case yourself.

I also plan to add benchmarks of the queues used as SPMC/MPSC, which may produce interesting results.

*Note that the original Vyukov implementation did not contain blocking methods.

SPSC Benchmarks:

These benchmarks show the time for one producer to transmit 1 billion uint64_t to one consumer. Benchmarks were run 5 times and averaged. Benchmarks are run with a capacity of 65536, or 65537 for non-power of 2 jdz trials. Power of 2/compile-time known size queues from this repository are the best performing, followed by Maxim Egorushkin's atomic queue (which also uses compile-time known power of 2 sizes). This repository's non-power of 2 runtime-sized queues achieve better performance than other non-power of 2 runtime-sized queues.

Benchmarked queues are:

x86_64 - Intel i7-11800H

spscl

aarch64 - Apple M1 Silicon

Interestingly, jdz-cmp's modulo operation is not optimized well on this benchmark despite the compile-time known size.

spsca

MPMC Benchmarks

These benchmarks show the throughput measured for one producer transmitting 100 million uint64_t to one consumer. Benchmarks were run 5 times and averaged. Benchmarks are run with a capacity of 8192, or 8193 for non-power of 2 jdz trials.

We can see clearly that moodycamel's queue is the best by far N threads > N cores, but performs less well below this.

Benchmarked queues are:

x86_64 - Intel i7-11800H

1p1cl 2p2cl 4p4cl 8p8cl 16p16cl

aarch64 - Apple M1 Silicon - 8 Logical Cores

TODO