skarupke/ska_sort

Make LSD radix sort faster with software write-combining

Opened this issue · 0 comments

Naive code: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/RadixSort.cpp#L38
Speed - 19 MKeys/s: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/results.txt#L377

Optimized code: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/RadixSort.cpp#L51
Speed - 94 MKeys/s: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/results.txt#L329

Speed measured on 100M 32-bit uniform keys, sorted by 4 passes of 256-bucket sort, using single core of i7-4770: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/TestRadixSort.cpp

The optimized code combines data into 64-byte packets stored in temporary buffer, and then writes whole packet into output bucket. This improves TLB usage efficiency and avoids hardware write-combining, thus improving resulting speed (for my code and my computer) 5 times!