Make LSD radix sort faster with software write-combining
Opened this issue · 0 comments
Naive code: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/RadixSort.cpp#L38
Speed - 19 MKeys/s: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/results.txt#L377
Optimized code: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/RadixSort.cpp#L51
Speed - 94 MKeys/s: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/results.txt#L329
Speed measured on 100M 32-bit uniform keys, sorted by 4 passes of 256-bucket sort, using single core of i7-4770: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/TestRadixSort.cpp
The optimized code combines data into 64-byte packets stored in temporary buffer, and then writes whole packet into output bucket. This improves TLB usage efficiency and avoids hardware write-combining, thus improving resulting speed (for my code and my computer) 5 times!