intel/x86-simd-sort

Argsort on int32_t severely underperforms

Opened this issue ยท 28 comments

Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).

The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.

Here's my benchmark on my hardware:

[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 4096 KiB (x6)
  L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t               959 ns          956 ns       731536
simdargsort/random_256/int32_t              2045 ns         2040 ns       343034
simdargsort/random_512/int32_t              3866 ns         3857 ns       181511
simdargsort/random_1k/int32_t               9289 ns         9267 ns        75493
simdargsort/random_5k/int32_t              50900 ns        50768 ns        13789
simdargsort/random_100k/int32_t          1764351 ns      1757047 ns          400
simdargsort/random_1m/int32_t           29256670 ns     29111150 ns           24
simdargsort/random_10m/int32_t         959237861 ns    954095660 ns            1
simdargsort/smallrange_128/int32_t           722 ns          720 ns       975799
simdargsort/smallrange_256/int32_t          2049 ns         2044 ns       342384
simdargsort/smallrange_512/int32_t          3655 ns         3646 ns       191985
simdargsort/smallrange_1k/int32_t           9169 ns         9147 ns        76497
simdargsort/smallrange_5k/int32_t          21375 ns        21320 ns        32747
simdargsort/smallrange_100k/int32_t       467703 ns       465853 ns         1502
simdargsort/smallrange_1m/int32_t        9097255 ns      9047326 ns           75
simdargsort/smallrange_10m/int32_t     183596592 ns    182645622 ns            4
simdargsort/sorted_10k/int32_t            104574 ns       104314 ns         6668
simdargsort/constant_10k/int32_t            8648 ns         8628 ns        81134
simdargsort/reverse_10k/int32_t           105977 ns       105722 ns         6558
scalarargsort/random_128/int32_t             797 ns          795 ns       882026
scalarargsort/random_256/int32_t            1736 ns         1732 ns       403901
scalarargsort/random_512/int32_t            4000 ns         3990 ns       176210
scalarargsort/random_1k/int32_t            21879 ns        21822 ns        31937
scalarargsort/random_5k/int32_t           262249 ns       261537 ns         2676
scalarargsort/random_100k/int32_t        7551632 ns      7525002 ns           93
scalarargsort/random_1m/int32_t        103352847 ns    102895746 ns            7
scalarargsort/random_10m/int32_t      1974523438 ns   1964054017 ns            1
scalarargsort/smallrange_128/int32_t         667 ns          665 ns      1051997
scalarargsort/smallrange_256/int32_t        1498 ns         1494 ns       469065
scalarargsort/smallrange_512/int32_t        3152 ns         3144 ns       223444
scalarargsort/smallrange_1k/int32_t        13730 ns        13697 ns        50943
scalarargsort/smallrange_5k/int32_t       127606 ns       127285 ns         5499
scalarargsort/smallrange_100k/int32_t    3031465 ns      3019854 ns          232
scalarargsort/smallrange_1m/int32_t     38177852 ns     37995219 ns           18
scalarargsort/smallrange_10m/int32_t   695700305 ns    692244521 ns            1
scalarargsort/sorted_10k/int32_t           92423 ns        92202 ns         7578
scalarargsort/constant_10k/int32_t         83139 ns        82931 ns         8439
scalarargsort/reverse_10k/int32_t          74214 ns        74031 ns         9359

HI @KungFuJesus. We haven't compared performance of avx512_argsort with IPP SortRadixIndexAscend yet. I will try to do this sometime next week. avx512_argsort used to leverage gather instruction heavily, but because of the DOWNFALL security vulnerability, we replaced that with a scalar emulation of it in #65 and I believe that has impacted its performance. Also worthy to note that IPP sort uses an additional O(N) space whereas avx512_argsort is O(1) space, so I am not sure if this is an apples-to-apples comparison in the first place.

Your benchmark numbers look right, and I don't believe we claimed avx512_argsort to be the orders of magnitude faster than its scalar counterpart.

I'm sure this project never made the claim, I just recall Phoronix touting something like a 10x gain when numpy used it in place of std::sort. I'm sure there'll be nuance there. The scalar replacement does explain why me disabling GDS mitigations didn't seem to make a difference, though. The additional memory requirement is acceptable in my case. I do wonder if the sorting network approach could be applied to a radix sort for the moves. E.g. instead of comparisons for the sorting network, the bucket positions get computed. I'm like 90% sure IPP's using purely scalar operations for everything but a linear sequence generation.

I'm sure this project never made the claim, I just recall Phoronix touting something like a 10x gain when numpy used it in place of std::sort

I believe that was for quicksort, not argsort.

The additional memory requirement is acceptable in my case.

If additional space is acceptable, then you could try using key-value sort keyvalue_qsort instead of argsort. Key-value sort will sort two vectors based on the value of one of them. Pass a copy of your array and an array of indices initialized to [0, 1, 2, .. N-1]. Unlike argsort, key-value sort doesn't rely on gather (or its emulation) to load data into registers and should, in theory, be faster. It does have the downside that you have make a copy of the array which can be expensive. Would be curious to know how it performs.

I had tried that initially to try to get 32 bit values for indices and it wasn't working for me correctly for whatever reason. I didn't try it with size_t indices, though.

32-bit indices weren't supported until recently: #108. Did you try it after this was merged?

Yes, last commit I was trying against was:
d9c9737

And it was wrong, and not wrong because it was unstable, but something very wonky. I'm computing a voxelization index and scanning through these voxel indices post sort with std::upper_bound to get their groupings. I went from something like 40k unique indices down to 2 or 3. Let me reconstruct the code I used to test, it should be a simple thing to verify correctness/incorrectness for if there was a mistake on my end.

edit: this is what I used

std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
    std::vector<uint32_t> idxs(numVals);
    std::iota(idxs.begin(), idxs.end(), 0);
    x86simdsort::keyvalue_qsort<uint32_t, int32_t>(idxs.data(), vals, numVals);
    return idxs;
}

Yeah, even with signed 32 bit types for the index I'm getting 4 unique voxel nodes instead of 47198 like there should be for this particular set of arguments.

Also flipping the keys and argument values around results 3 unique nodes, so, I don't think that's it, either.

If you can't reproduce this issue with random data I can gladly provide you this data in ascii since it's just indices, but it's going to be about 123 million integers.

your argsort implementation looks incorrect. You want to sort idxs based on the values in vals: therefore vals is the key and idxs is the value.

std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
    std::vector<uint32_t> idxs(numVals);
    std::iota(idxs.begin(), idxs.end(), 0);
    x86simdsort::keyvalue_qsort(vals, idxs.data(), numVals);
    return idxs;
}

Couple of things to note: numVals needs to be < UINT32_MAX. Also, I am assuming vals is a copy of your data because the key-value sort will sort both vals and idxs arrays.

idxs should have numVals unique values (basically [0,1,2, ..., numVals-1] in a jumbled up order, else there is something seriously wrong with the code.

numVals == ~103 million, which is certainly smaller than the 2.5ish billion value of UINT32_MAX. Like I said before, I switched around vals and idxs in case I was confusing keys and values and still ended up with only 3 unique values. I haven't printed even a small slice of the output yet, I'm mostly just looking at what my std::upper_bound scan is concluding.

Did you want a gzip compressed text document of the vals array?

Also, I am assuming vals is a copy of your data because the key-value sort will sort both vals and idxs arrays.

In this circumstance it's ok to be destructive to vals as this is a temporary voxel node index calculation that doesn't survive past this. It's merely there to get row indices for a bunch of 3d points to arrange spatially in a grid.

Err wait...no that's not strictly true. OK, so I may need to make an explicit copy. Let me try that.

Ok, yeah that was it. Thanks for the tip. Here's the timing:
radix sort time: 3825.987119, kvSort time = 4555.936716

So radix sort is still winning there by a fair margin :(. Now, I did revert to the commit just prior to your PR to see what the performance was like with actual gathers (and GDS mitigations disabled, of course), and at least with the benchmarks it looks ~1.8x faster, so maybe it might have been beating IPP's index radix sort if we could rely on the microcode mitigations not being loaded.

Another thing going for this is that the reason it wasn't working is I was doing what the code was already doing, deriving the post sorted values. So, there's some savings there too but maybe not enough to beat radix index sort alone.

benchmarks it looks ~1.8x faster, so maybe it might have been beating IPP's index radix sort

Yeah, that is unfortunate. Hopefully future hardware won't need these mitigations and we can have a version of argsort with the gather instruction.

For there is scope of performance improvement for the [32-bit, 32-bit] key value sort, which we are working on. Will keep this open till then.

@KungFuJesus PR #120 improves 32-bit key value sorting by a fair bit. Do you mind timing your benchmarks with this patch to see if it improves anything at your end?

Sure, I'll check it out today and see the improvements. The mention of zmm registers I assume means everything is still AVX512 only - is this sort implementation possible on AVX2 era instructions or is that forever stuck on scalar implementation?

Hmm, I'm not seeing any significant improvements from this. I am using the key-value sort explicitly rather than argsort in my code, does that matter?

edit: Ok, I am seeing a marginal improvement for a much much larger set:

Before:

[INFO] VoxelGrid.cpp:93:VoxelGrid(): Voxel node sort completed in 53652.993568 ms

After:

[INFO] VoxelGrid.cpp:93:VoxelGrid(): Voxel node sort completed in 48680.894693 ms

Is this the expected gain or should we have seen something more significant?

is this sort implementation possible on AVX2 era instructions or is that forever stuck on scalar implementation?

#119 adds AVX2 versions but the performance gains won't match AVX-512.

Is this the expected gain or should we have seen something more significant?

On random data, I am seeing a 2x improvement in my benchmarking. What does your data distribution look like? Are you still benchmarking this piece of code where both the key and value vectors are 32-bit?

std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
    std::vector<uint32_t> idxs(numVals);
    std::iota(idxs.begin(), idxs.end(), 0);
    x86simdsort::keyvalue_qsort(vals, idxs.data(), numVals);
    return idxs;
}

Also, what is the array size?

What does your data distribution look like?

It's 3d point cloud data that's likely to have a partial amount of order to it. The point locations themselves are used to compute an integer voxel node index by dividing by a uniform grid length on all 3 dimensions. I have a variant that produces this from the GPU and the data gets returned back as the rays intersect, with an atomic value on the gpu determining insertion index (making it slightly more random). The CPU version comes back in the order that the rays are cast, but the rays are formed in a 2d grid on a plane from some distance away, no guarantee for the data to be in voxel node order at all.

Are you still benchmarking this piece of code where both the key and value vectors are 32-bit?

Yes, I made sure of that, we're operating on unsigned 32 bit indices and the values themselves (the voxel nodes) are signed. Not sure if that matters at all, though.

Also, what is the array size?

This particular example is 891416312 nodes. Though I tried it on a significantly smaller dataset (41445847 node indices) so that it wasn't going to be strictly a memory bandwidth benchmark and it looked to be roughly the same time before and after (maybe 100ms worse).

Hmm not sure if I am measuring it wrong but I am seeing results differently. PR #121 compares copying the original array and sort indices using keyvalue_qsort against ippsSortRadixIndexAscend_32s and from what I am measuring keyvalue_qsort is faster than ippSortRadixIndex.

Benchmark                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.5941         -0.5939          2657          1078          2660          1080
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.5208         -0.5195          3260          1562          3261          1567
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.4477         -0.4476          4506          2489          4510          2491
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.3853         -0.3865          7284          4478          7304          4481
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.3024         -0.3020         32020         22338         32037         22360
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.2371         -0.2370        901383        687703        901416        687765
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.6831         -0.6831      27155120       8604753      27153895       8604067
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.9221         -0.9221    1785294341     139015600    1785197181     139002825
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]                -0.9313         -0.9313   23847222073    1637784682   23845959455    1637654538
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]_pvalue          0.3413          0.3413      U Test, Repetitions: 9 vs 18
OVERALL_GEOMEAN                                                                       -0.6612         -0.6610             0             0             0             0

EDIT: these are measured for various array sizes: [128, 256, 512, 1024, 5k, 100k, 1million, 10m and 100m]

I can provide a dataset, if that would help. I'm fairly confident sharing it would be acceptable given that they are just arbitrary voxel indices.

sure. I can take a look.

So here's a smaller example. A quick note, this does happen to be faster overall because it applies the sorting to the keys, which is why I'm using it over IPP's radix sort while on AVX512. However, the IPP IndexRadixSortAscend had been faster than argsort by itself, which should be a more fair comparison. I've not tested the argsort function since then, but with this PR, the invocation of keyvalue_qsort is not any faster.

The data itself is a bunch of little endian 32 bit integers serialized to a file and then that is gzipped. So you should be able to use something like fread on little endian to get all of the integers (calculating the size by dividing the number of bytes by 4).
exampleInput.gz

This is from voxels coming out of the Embree code, so they are less scrambled than the GPU counterpart but not guaranteed to be in order.

Any progress on testing this data set?

apologies, I haven't had time yet. I will try to take a look at it this week.

HI @KungFuJesus yeah looks like for the data distribution you have, I am also measuring ippindexsort to be faster than avx-512 argsort. I wonder if the new pivot selection might help with it. Could you try benchmarking with #134?