intel/x86-simd-sort

Merge sort and AVX512

Opened this issue · 3 comments

Hi!

Many thanks for your contribution to speeding up sorting in numpy. Wanted to ask if there are any plans to speed up merge/tim sort with AVX2/AVX512?

Hi @sergii-mamedov that isn't in my to do list (at least not yet) unless there is a good enough justification to work on those. Is there a reason why merge/tim sort will be preferred over quicksort?

Hi @r-devulap
I was supported the METASPACE project. Sorting is one of the most time-consuming steps. We have from a hundred to ten thousand already sorted arrays (m/z spectra), which we need to combine into one array and sort. The total size of arrays ranges from hundreds of megabytes to hundreds of gigabytes.
I tested quicksort vs mergesort (in numpy terminology) two years ago - mergesort was 5-10 times faster. Last weekend I tested the latest release of numpy - mergesort is still about twice as fast on my data.
Also, a feature of merge/tim sort is that it is a stable variant of sorting, that is, the order of elements is determined. This is important for this task.

curious as to what data type and distribution you have in your application. On NumPy benchmarks, quicksort is slower than merge sort only for ordered and reverse arrays:

             quick   float64        ('random',)         37.9±0.02μs
              quick   float64        ('ordered',)         37.3±0.4μs
              quick   float64       ('reversed',)         37.8±0.2μs
              quick   float64        ('uniform',)        5.44±0.04μs
              quick   float64    ('sorted_block', 10)    36.2±0.08μs
              quick   float64   ('sorted_block', 100)    39.7±0.05μs
              quick   float64   ('sorted_block', 1000)    43.5±0.1μs
              merge   float64        ('random',)          609±0.7μs
              merge   float64        ('ordered',)        18.3±0.03μs
              merge   float64       ('reversed',)        12.7±0.04μs
              merge   float64        ('uniform',)        18.3±0.03μs
              merge   float64    ('sorted_block', 10)      169±1μs
              merge   float64   ('sorted_block', 100)     123±0.8μs
              merge   float64   ('sorted_block', 1000)   68.2±0.06μs