Uses rayon
for parallelism and falls back on unstable_sort
for short slices.
Only works on power-of-two arrays for now.
Benchmarks on 4-core (8 threads) Kaby Lake 3.8GHz laptop:
test bitonic_128 ... bench: 1,110 ns/iter (+/- 79)
test bitonic_32768 ... bench: 554,374 ns/iter (+/- 28,342)
test bitonic_65536 ... bench: 1,211,296 ns/iter (+/- 132,500)
test rayon_stable_128 ... bench: 1,743 ns/iter (+/- 85)
test rayon_stable_32768 ... bench: 445,732 ns/iter (+/- 22,396)
test rayon_stable_65536 ... bench: 884,402 ns/iter (+/- 30,717)
test rayon_unstable_128 ... bench: 1,066 ns/iter (+/- 148)
test rayon_unstable_32768 ... bench: 402,498 ns/iter (+/- 17,377)
test rayon_unstable_65536 ... bench: 748,362 ns/iter (+/- 41,358)
test std_stable_128 ... bench: 1,718 ns/iter (+/- 98)
test std_stable_32768 ... bench: 1,231,475 ns/iter (+/- 45,878)
test std_stable_65536 ... bench: 2,618,005 ns/iter (+/- 114,323)
test std_unstable_128 ... bench: 1,185 ns/iter (+/- 112)
test std_unstable_32768 ... bench: 895,112 ns/iter (+/- 80,483)
test std_unstable_65536 ... bench: 1,774,216 ns/iter (+/- 48,544)