1 Billion Row Challenge in Rust

Environment for the benchmark

48 Intel vCPUs / 96 GB Memory / 600 GB Disk, dedicated CPU-optimized DigitalOcean instance with Premium Intel CPU, c-48-intel
Ubuntu 24.04 LTS (GNU/Linux 6.8.0-31-generic x86_64)
Rust compiler rustc 1.78.0 (9b00956e5 2024-04-29), x86_64-unknown-linux-gnu, LLVM version: 18.1.2
Java 21.0.3 2024-04-16 LTS, Java HotSpot(TM) 64-Bit Server VM Oracle GraalVM 21.0.3+7.1 (build 21.0.3+7-LTS-jvmci-23.1-b37, mixed mode, sharing)
File with 1 billion measurements that is used as an input is stored in tmpfs

Reference Rust solution tumdum/1brc

The results obtained by running scripts/run_benchmark_original.sh. It slightly modifies the original code to allow passing number of cores and also makes sure the project is built with native CPU support to get max performance.

Number of threads	Mean [s]	Min [s]	Max [s]
1	42.898 ± 0.068	42.830	42.992
2	21.718 ± 0.019	21.703	21.745
4	10.955 ± 0.046	10.904	11.017
8	5.686 ± 0.036	5.652	5.724
16	3.024 ± 0.019	3.006	3.050
24	2.209 ± 0.029	2.168	2.231
32	2.185 ± 0.013	2.176	2.205
48	1.722 ± 0.007	1.711	1.726

My results

The results obtained by running scripts/run_benchmark.sh.

Implementation	Number of threads	Mean [s]	Min [s]	Max [s]
parse_large_chunks_as_i64_v2	1	19.797 ± 0.031	19.764	19.830
parse_large_chunks_as_i64_as_java	1	16.848 ± 0.050	16.784	16.900
parse_large_chunks_as_i64_v2	2	10.010 ± 0.016	9.990	10.030
parse_large_chunks_as_i64_as_java	2	8.570 ± 0.010	8.558	8.579
parse_large_chunks_as_i64_v2	4	5.077 ± 0.010	5.065	5.090
parse_large_chunks_as_i64_as_java	4	4.373 ± 0.005	4.368	4.380
parse_large_chunks_as_i64_v2	8	2.650 ± 0.010	2.639	2.661
parse_large_chunks_as_i64_as_java	8	2.299 ± 0.009	2.286	2.306
parse_large_chunks_as_i64_v2	16	1.496 ± 0.014	1.478	1.507
parse_large_chunks_as_i64_as_java	16	1.321 ± 0.006	1.312	1.326
parse_large_chunks_as_i64_v2	24	1.148 ± 0.009	1.137	1.156
parse_large_chunks_as_i64_as_java	24	1.057 ± 0.058	1.002	1.138
parse_large_chunks_as_i64_v2	32	1.232 ± 0.019	1.214	1.259
parse_large_chunks_as_i64_as_java	32	1.127 ± 0.009	1.116	1.138
parse_large_chunks_as_i64_v2	48	1.223 ± 0.012	1.210	1.238
parse_large_chunks_as_i64_as_java	48	1.174 ± 0.008	1.163	1.180

Single-thread dummy implementations to understand how fast it can be

Implementation	Mean [s]	Min [s]	Max [s]
naive_line_by_line_dummy	44.508 ± 0.370	44.191	45.042
parse_large_chunks_as_bytes_dummy	25.021 ± 0.059	24.941	25.081
parse_large_chunks_as_i64_dummy	16.163 ± 0.156	16.021	16.328
parse_large_chunks_simd_dummy	10.931 ± 0.044	10.896	10.995

All implementations in single thread

Command	Mean [s]	Min [s]	Max [s]	Relative
naive_line_by_line	91.523 ± 0.482	90.955	92.110	5.43 ± 0.03
naive_line_by_line_v2	58.189 ± 0.306	57.847	58.493	3.45 ± 0.02
parse_large_chunks_as_bytes	33.714 ± 0.096	33.623	33.848	2.00 ± 0.01
parse_large_chunks_as_i64	26.451 ± 0.478	26.122	27.162	1.57 ± 0.03
parse_large_chunks_as_i64_v2	19.797 ± 0.031	19.764	19.830	1.18 ± 0.00
parse_large_chunks_as_i64_unsafe	27.408 ± 0.390	27.058	27.780	1.63 ± 0.02
parse_large_chunks_as_i64_as_java	16.848 ± 0.050	16.784	16.900	1.00
parse_large_chunks_simd	31.520 ± 0.082	31.437	31.632	1.87 ± 0.01
parse_large_chunks_simd_v1	30.221 ± 0.339	29.961	30.689	1.79 ± 0.02

Fastest Java solution CalculateAverage_thomaswue.java

Type of run	Number of threads	Mean [s]	Min [s]	Max [s]
JVM	1	9.420 ± 0.060	9.327	9.504
Native Image	1	8.647 ± 0.038	8.612	8.704
JVM	48	2.133 ± 0.236	1.857	2.586
Native Image	48	0.4121 ± 0.0051	0.4019	0.4163

JVM

hyperfine --warmup 4 --runs 10 --export-markdown java_thomaswue.md "java --enable-preview --class-path /root/code/github/gunnarmorling/1brc/target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage_thomaswue"

Native Image

hyperfine --warmup 4 --runs 10 --export-markdown java_native_thomaswue.md /root/code/github/gunnarmorling/1brc/target/CalculateAverage_thomaswue_image

REASY/1brc