parallel-sort-bench: A C++ repository from jerrycxj

Benchmarks of various sorting algorithms, including parallel sorts.

Benchmark results on a GCE n2-standard-8 instance (Ice Lake):

Run on (8 X 2600.02 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 55296 KiB (x1)
Load Average: 0.15, 0.38, 0.19
---------------------------------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations      Bytes      Items
---------------------------------------------------------------------------------------------------------------------
BM_Sort/TbbSort/random/process_time                        3.17 s          16.7 s             1 61.2207M/s 8.02432M/s
BM_Sort/TbbSort/sorted_runs/process_time                   1.58 s          8.02 s             1 127.615M/s 16.7267M/s
BM_Sort/HwySort/random/process_time                        2.27 s          2.27 s             1 451.365M/s 59.1613M/s
BM_Sort/HwySort/sorted_runs/process_time                   2.38 s          2.38 s             1 429.987M/s 56.3593M/s
BM_Sort/StdPartitionHwySort/random/process_time            2.06 s          4.01 s             1 255.071M/s 33.4327M/s
BM_Sort/StdPartitionHwySort/sorted_runs/process_time       1.11 s          3.21 s             1 318.723M/s 41.7757M/s
BM_Sort/PdqSort/random/process_time                        5.67 s          5.67 s             1 180.528M/s 23.6622M/s
BM_Sort/PdqSort/sorted_runs/process_time                   5.30 s          5.30 s             1 193.329M/s   25.34M/s
BM_Sort/IntelX86SIMDSort/random/process_time               3.73 s          3.73 s             1 274.734M/s 36.0099M/s
BM_Sort/IntelX86SIMDSort/sorted_runs/process_time          2.90 s          2.90 s             1 352.558M/s 46.2105M/s

tbb::sort is only able to utilize 5.2 of 8 cores, on average, because the early partition steps are so slow.
hwy::Sorter is able to outperform tbb::sort on random data despite using only a single thread, but on sorted runs it does not go any faster.
std::partition + hwy::Sorter to make a parallel algorithm only helps a little in wall time for random data, but makes a surprisingly large difference for sorted runs. note the pivot selection used here is just the last element, which is obviously not robust.
pdqsort does not benefit from sorted runs, oddly.
avx512_sort runs a bit faster with sorted runs, but it still slower than hwy. 64% and 22% slower for random and sorted runs, respectively.

jerrycxj/parallel-sort-bench