async-bench

A benchmark of approaches to writing server applications.

Frameworks

The benchmark distinguishes two types of approaches:

shared

This includes approaches that:

can split work across multiple processors
can handle non-uniform load (e.g. by having a shared run queue)
provide some synchronization primitives between tasks

This includes:

threads (one thread per connection with blocking I/O)
libfev (Note I am the author of both this library and this benchmark)
Boost.Asio
Go
Tokio
async-std

prefork

This type uses SO_REUSEPORT flag and relies on the underlying kernel to split the work across multiple threads. After a connection is accepted, it is pinned to one thread and any synchronization is mostly avoided. However, this type does not provide any strategy to handle non-uniform load, and thus underutilization of processors is possible.

Benchmarks

hello is a simple server that awaits for a request and sends a valid HTTP response. It doesn't parse requests. After reading any data, it sends a response.

Moreover, it tries to send the HTTP response fully (in one write/send call). If it fails to do it, the process is killed. However, it didn't happen during the benchmark and each response was fully written in one call.

These simplifications were made so that we can focus on measuring how well the frameworks handle I/O and scheduling. HTTP was chosen so that available benchmarking tools can be used as well, such as wrk.

hello-timeout, in addition, adds 5 seconds timeouts for both reading and writing. This should show how well timers are handled.

TODO: Add some benchmarks that use synchronization primitives.

Throughput

Each server implementation spawns 12 workers (in the case of threads implementation, the server can use as many threads as it wants). The benchmarking tool spawns also 12 threads and 64 connection per each thread. Each connection does 20k requests. Wall-clock time is measured and then a number of requests per second is calculated.

Each test consists of 3 warm up rounds and 30 normal rounds, then the average of normal rounds results is calculated. After each test, the system is rebooted.

Both servers and the benchmarking tool share the available processors on a machine with 6 cores and 12 threads.

hello

The following command for each framework was used:

./bench-throughput.sh ./FRAMEWORK/hello 127.0.0.1 3000 12 12 64 20000 3 30

shared

framework	reqs/s
threads	930,142
Boost.Asio	1,011,690
Go	1,265,821
Tokio	1,120,036
async-std	974,002
fev-epoll-work-sharing-bounded-mpmc	1,289,267
fev-epoll-work-sharing-locking	1,337,659
fev-epoll-work-sharing-simple-mpmc	1,281,790
fev-epoll-work-stealing-bounded-mpmc	1,348,033
fev-epoll-work-stealing-bounded-spmc	1,347,916
fev-epoll-work-stealing-locking	1,352,952
fev-io_uring-work-sharing-bounded-mpmc	1,025,687
fev-io_uring-work-sharing-locking	1,151,718
fev-io_uring-work-sharing-simple-mpmc	1,022,070
fev-io_uring-work-stealing-bounded-mpmc	1,217,764
fev-io_uring-work-stealing-bounded-spmc	1,208,581
fev-io_uring-work-stealing-locking	1,214,600

prefork

framework	reqs/s
raw-epoll	1,401,905
Boost.Asio	1,335,521
libuv	1,373,606

hello-timeout

The following command for each framework was used:

./bench-throughput.sh ./FRAMEWORK/hello-timeout 127.0.0.1 3000 12 12 64 20000 3 30

shared

framework	reqs/s
threads	904,918
Boost.Asio	478,371
Go	1,126,572
Tokio	735,227
async-std	928,534
fev-epoll-work-sharing-bounded-mpmc	1,259,515
fev-epoll-work-sharing-locking	1,291,766
fev-epoll-work-sharing-simple-mpmc	1,248,902
fev-epoll-work-stealing-bounded-mpmc	1,303,468
fev-epoll-work-stealing-bounded-spmc	1,300,814
fev-epoll-work-stealing-locking	1,302,324
fev-io_uring-work-sharing-bounded-mpmc	945,220
fev-io_uring-work-sharing-locking	1,002,519
fev-io_uring-work-sharing-simple-mpmc	928,448
fev-io_uring-work-stealing-bounded-mpmc	1,069,672
fev-io_uring-work-stealing-bounded-spmc	1,079,243
fev-io_uring-work-stealing-locking	1,089,061

prefork

framework	reqs/s
Boost.Asio	1,252,307

Latency

Each server implementation spawns 6 workers (in the case of threads taskset -c 0-5 is used). The benchmarking tool spawns also 6 threads and 64 connections per each thread. Each connection does 20k requests. After receiving a response, the benchmarking tool delays the next request for 1ms. The time between a request and its response is measured.

Each test consists of 3 warm up rounds and 30 normal rounds, then the average of normal rounds results is calculated (e.g. an average of medians from 30 rounds) and presented in nanoseconds. After each test, the system is rebooted.

qX denotes quantiles. For example, a value for q0.9999 column means that 99.99% of all requests took just as much or less time than that value.

hello

The following command for each framework was used:

./bench-latency.sh ./FRAMEWORK/hello 127.0.0.1 3000 6 6 64 20000 1000000 3 30

shared

framework	mean	median	q0.9	q0.99	q0.999	q0.9999
threads	19,913	18,070	32,032	43,737	53,163	126,103
Boost.Asio	21,393	19,318	32,846	49,131	86,839	338,921
Go	18,974	17,092	29,505	43,911	59,890	209,802
Tokio	17,930	16,377	26,108	38,347	65,201	562,775
async-std	20,544	18,534	29,326	43,102	175,335	807,858
fev-epoll-work-sharing-bounded-mpmc	17,602	16,212	26,750	37,016	46,410	146,483
fev-epoll-work-sharing-locking	17,435	16,089	26,321	36,395	45,412	155,440
fev-epoll-work-sharing-simple-mpmc	17,699	16,329	26,821	37,018	46,070	152,688
fev-epoll-work-stealing-bounded-mpmc	18,904	17,168	29,364	41,342	51,860	555,482
fev-epoll-work-stealing-bounded-spmc	18,911	17,153	29,367	41,361	51,835	664,495
fev-epoll-work-stealing-locking	18,725	17,018	29,018	40,837	51,203	505,659
fev-io_uring-work-sharing-bounded-mpmc	69,758	21,094	47,885	788,073	9,267,102	14,517,913
fev-io_uring-work-sharing-locking	26,449	22,329	43,839	83,165	159,117	246,950
fev-io_uring-work-sharing-simple-mpmc	48,472	22,607	51,371	445,012	4,072,171	7,680,008
fev-io_uring-work-stealing-bounded-mpmc	58,697	39,867	117,507	286,562	518,514	2,971,076
fev-io_uring-work-stealing-bounded-spmc	58,104	39,753	117,507	286,530	505,817	939,609
fev-io_uring-work-stealing-locking	54,257	37,978	108,610	263,357	465,995	717,205

prefork

framework	mean	median	q0.9	q0.99	q0.999	q0.9999
raw-epoll	17,513	15,558	27,646	38,822	47,897	554,005
Boost.Asio	18,063	16,092	28,258	40,000	49,434	442,801
libuv	18,164	16,130	28,641	40,406	49,781	508,164

hello-timeout

The following command for each framework was used:

./bench-latency.sh ./FRAMEWORK/hello-timeout 127.0.0.1 3000 6 6 64 20000 1000000 3 30

shared

framework	mean	median	q0.9	q0.99	q0.999	q0.9999
threads	20,405	18,508	32,896	44,903	54,687	140,981
Boost.Asio	21,459	19,372	33,178	49,491	84,077	180,739
Go	20,110	18,057	31,519	47,055	64,470	165,667
Tokio	23,378	21,526	35,182	51,618	75,525	455,931
async-std	23,026	20,626	32,986	46,471	229,452	1,204,809
fev-epoll-work-sharing-bounded-mpmc	18,542	17,128	28,066	38,707	47,698	132,021
fev-epoll-work-sharing-locking	18,465	17,031	28,014	38,704	47,466	136,518
fev-epoll-work-sharing-simple-mpmc	18,674	17,218	28,396	39,215	47,925	147,948
fev-epoll-work-stealing-bounded-mpmc	20,350	18,473	31,721	44,484	55,087	372,055
fev-epoll-work-stealing-bounded-spmc	19,925	18,070	31,030	43,626	54,130	412,133
fev-epoll-work-stealing-locking	19,874	18,035	30,932	43,489	53,939	377,958
fev-io_uring-work-sharing-bounded-mpmc	29,438	20,990	38,865	71,960	1,227,937	10,102,328
fev-io_uring-work-sharing-locking	22,644	20,145	35,441	54,069	88,632	163,720
fev-io_uring-work-sharing-simple-mpmc	25,832	20,969	38,377	68,521	332,538	4,341,473
fev-io_uring-work-stealing-bounded-mpmc	42,712	32,270	78,126	186,425	332,363	535,485
fev-io_uring-work-stealing-bounded-spmc	42,962	32,455	78,650	188,493	336,021	548,401
fev-io_uring-work-stealing-locking	41,850	32,067	75,920	178,941	316,551	486,942

prefork

framework	mean	median	q0.9	q0.99	q0.999	q0.9999
Boost.Asio	18,364	16,417	28,488	40,526	50,294	444,049

Environment

i7-8700k (6 cores, 12 threads)
Linux 5.8.5-arch1-1 with mitigations=off
GCC 10.2.0
Boost 1.72
Rust 1.46.0 (04488afe3 2020-08-24)
Go 1.15

patrykstefanski/async-bench

async-bench

Frameworks

shared

prefork

Benchmarks

Throughput

hello

shared

prefork

hello-timeout

shared

prefork

Latency

hello

shared

prefork

hello-timeout

shared

prefork

Environment