scandum/wolfsort

Updating fluxsort/quadsort?

mlochbaum opened this issue · 53 comments

In my benchmarks for Robin Hood Sort, I used the wolfsort benchmark as a baseline, not realizing that the versions of quadsort and fluxsort here are now very out of date. So my chart has some misleading comparisons. And with current versions it's hard to build everything at once because wolfsort relies on the older fluxsort interface.

Would you mind updating this repository, or do you have advice on another way to run my benchmarks? If you won't have time in the next few days, let me know and I'll find some hack to get them updated. I can also probably PR code updates, I just can't redo the graphs.

Another comment, you can see on the graph at the bottom that wolfsort's cutoff of 1024 looks much too high. I think this comes from testing on the same data repeatedly: fluxsort gets an unrealistic advantage because (I assume) its branches on various subarray sizes are predicted much better. In my average benchmarks I advance the start position by 1 each time, which removes a dip in fluxsort under ~3e3 resulting in that gap at the cutoff.

And nice work on quadsort by the way!

Here's my line graph with updated quad/flux (wolfsort still using old versions), and the cutoff removed. ska_sort_copy has its own cutoff problem, which obviously I didn't touch.

rand

Nice work. I'll try to have the wolfsort benchmark fully updated in about 6 hours. I'll also take a look at the cut-off.

I've updated the source code. Couple of new things:

blitsort: updated from a rotate mergesort to a rotate quick/merge sort
crumsort: new in-place unstable branchless quicksort using a novel unstable partitioning scheme
fluxsort: updated to use faster pivot selection and benefits from quadsort improvements to small array sorting
gridsort: benefits from quadsort changes and does well on very large arrays
quadsort: updated to use branchless parity merges
wolfsort: now uses fluxsort as the fallback and quadsort for small array sorting

I'll put up some graphs in a little bit.

Thanks, I've updated my benchmarks to match. Will be investigating crumsort, although it looks like it's not different enough from flux to add yet another bar to the RH chart (and the stable sort makes more sense for comparison anyway). Exciting though! I've been thinking I can make LSD radix sort in-place (well, constant 1KB or something) as a base case for ranges under 2^16, by collapsing the elements to 2-byte, radix sorting back and forth in the freed half, and expanding again. I'll probably end up using crumsort as the outer part for that experiment.

As far as I can tell crumsort pulls ahead of fluxsort at 1M+ elements, undoubtedly due to being in-place.

I'll definitely be taking a closer look at RH, the performance looks spectacular. Wolfsort seems to be running about 33% faster on my machine than yours for 100K elements.

Using the trie approach on a radix sort definitely sounds interesting.

That makes sense about crumsort on larger arrays. So a crumsort → fluxsort pathway to cut down memory use to a small fraction of the input array would be sensible. And it looks like fulcrum partition doesn't obliterate patterns in the way pdqsort does, is that right?

I think my biggest conclusion from working with RH was actually that radix sort is a lot better that I realized (most implementations out there are awful). The adaptiveness of RH is nice, but that comes with a lot of inconsistency. I've just tested what I think is the worst case of a single huge element, and small random numbers otherwise. It degrades to 70-90ns/v on medium-length arrays. Although all but 30ns or so is from my terrible merge, and using quadsort (without even skipping up to the block size) brings it to more like 45-65ns. I think I can do better by trying to skip blocks within merges as well.

Radix sort is actually better than it looks on the bar chart in my opinion. The bad cases are from cache associativity as I analyzed here, and completely disappear for small arrays. And a 2-byte sort is more than twice as fast.

Counting sort is way faster than either: that's what RH uses on the Random%100 test. Admittedly a best case due to the very small range, but with the right implementation (tricky) it stays near that performance even for ranges a little above the length. This is what I did when implementing Dyalog APL's sorting algorithm, although I think I can improve it with a threshold strategy slightly similar to RH: only keep the count mod 16, appending the value to an overflow list whenever that rolls over. Then sort the overflow list, and write the mod 16 part branchlessly, and any overflow using branching.

I haven't checked for pattern obliteration, I think fulcrum will leave large sorted ranges mostly intact.

As for memory use, blitsort is very good when provided with 4 * sqrt(n) of aux. It's slightly faster than flux at 10M elements with 32768 aux.

I'm very impressed with RH, admittedly Wolfsort is mostly a hack job, an experiment with a radix / comparison hybrid with the radix part being very primitive. I may go back to the drawing board sometime and see if I can come up with something better.

I've been thinking about using fluxsort to partition down to 10K elements to deal with 1M+ sized arrays. So far I've had little luck getting counting sort to run fast on 64 bit integers.

I checked out your link and have some suggestions.

I did quite a bit of research on binary searching. I developed the monobound binary search which is up to 4 times faster than the standard binary search you discuss. I also wrote a monobound_bsearch() routine for complex data. I did some work on interpolation and adaptive binary searches as well.

I'm not sure if binary searches can be considered branchless like you seem to indicate.

Tree sorts seem to be missing, possibly because you don't consider them performant, but gridsort is showing some promise. I personally tend to make a distinction between bottom up and top down sorts, and I categorize tree sorts as a bottom up quick / partition sort. As gridsort shows adaptive partitioning is possible from the bottom up, though performance is a challenge.

Rotations are interesting as a form of in-place partitioning. I did some work on that as well, with rotations becoming relevant once you make them fast enough.

All in all a pretty useful article.

Did you read the linked pvk article? As far as I can tell monobound_binary_search in binary_search.c performs the same arithmetic (e.g. &base[half] corresponds to bot + mid). It's branchless if the conditional assignment of base/bot is performed with a cmov, and it's usually not hard to arrange things so the compiler does that. There's a branch for the loop of course, but it's perfectly predictable as the length is constant.

I had not read the pvk article, the code is indeed functionally identical.

Added proper quadsort integration (description), and I get the following worst case, fairly reasonable.

badq

Also figured out that it's trivial to use RH as the first pass of a merge sort, which with quad merging is faster than fluxsort on random numbers up to 1e7. Fluxsort as the outer part should be better of course.

randq

I did the packed 2-byte radix thing (or trie strategy), and put the code in this repository. Performance is very good: around 6ns/v for ranges below 2^16 and lengths from ~100 to 2^16, plus the standard <1ns/v for each partition to get there. With only 2KB for the counts. Competitive with ska_sort_copy on large arrays for ranges below about 2^18. Lots of other distribution sorts to be added if swap space is available, but that was the only in-place one that seemed promising.

The worst case looks manageable. I've been labeling sorts as adaptive if random or bit-reversal is the worst case, and as semi-adaptive if there are exceptions.

As for tail_merge not working out, it's not so great for small ranges, quadsort should be fairly optimal if you have a bucket size of 32 elements.

I wrote up my own little blurb about the optimal range for radix sort:

https://github.com/scandum/wolfsort#setting-the-bucket-size

I should probably have written L2 instead of L1 cache, though I assume it's actually both that play a role, and might explain the square root access time you make mention of? In all honesty, I'm just making guesses since I don't have a full understanding of the subject. Fluxsort seems pretty much cache oblivious in your graph.

Counting sort definitely has my attention now, I assume 64 bit processors have turned it from a relatively useless 1/2 byte sort into a much more relevant 4 byte sort?

I have some code written to get the min and max in the analyzer with fairly minimal performance loss. I might do some thinkering with crumsort to improve generic data detection. I do have working run detection in my local fluxsort that significantly improves ascending tiles and might have a small benefit for generic as well.

size_t FUNC(flow_analyze)(VAR *array, size_t nmemb, VAR *min, VAR *max, CMPFUNC *cmp)
{
        char loop, dist;
        size_t cnt, balance = 0, streaks = 0;
        VAR *pta, *ptb, swap;

        pta = array;

        *min = *max = pta[nmemb - 1];

        for (cnt = nmemb ; cnt > 16 ; cnt -= 16)
        {
                for (dist = 0, loop = 16 ; loop ; loop--)
                {
                        if (cmp(min, pta) > 0) *min = *pta; else if (cmp(pta, max) > 0) *max = *pta;

                        dist += cmp(pta, pta + 1) > 0; pta++;
                }
                streaks += (dist == 0) | (dist == 16);
                balance += dist;
        }

        while (--cnt)
        {
                if (cmp(min, pta) > 0) *min = *pta; else if (cmp(pta, max) > 0) *max = *pta;

                balance += cmp(pta, pta + 1) > 0;
                pta++;
        }

        if (balance == 0)
        {
                return 1;
        }

My understanding of counting sort was a bit off. Making it work for large ranges will be an interesting challenge.

I have updated the crumsort code here: https://github.com/scandum/crumsort with prior pivot improvements. It doesn't use the analyzer to determine the minimum and maximum, but it is able to start checking min/max midways through partitioning.

I'm not entirely sure about the math, but I think this will give crumsort access to the prior pivot for log n * log n - log n partitions instead of log n * log n / 2. Min/Max should be available for log n * log n - log n * 2.

I had to lower the quadsort threshold from 1/16th to 1/32th because removing the pivot from the distribution appears to increase the odds of a worst case. I'm not sure why. It could be a weakness of pseudomedian selection, I might have to give median of 8/16/32 a try.

Catching up on older comments: trinity rotate is great for sure. I think it should also vectorize straightforwardly, as the movement's the same just with some vector registers reversed (handling the edges will be a pain as always). Compilers don't seem to know how to do this. I see vector code for the in-order *--ptc = *--ptd; *ptd = *--ptb; part but not the reversals. Regarding gridsort, I don't like the idea that initial elements of the array determine how later ones are bucketed (assuming I understand this right), because it seems natural for that distribution to be very different from one part of the array to the next. I don't see why this should have any advantage over a samplesort that partitions with binary search, unless there's a need for online sorting. But I do think samplesort is worth investigating, which is why I tend to reference IPS⁴o. Everything they write is so hard to read though.

Counting sort, and the 2-byte radix sort strategy, are only applicable to "small" ranges, yes. I think in practice a large array to be sorted is overwhelmingly likely to fit that criterion. For counting sort, the range can be up to 4 times larger than the length with good (i.e. better than RH) performance, and the large memory use required for those ranges can be cut by partitioning. For 2-byte radix sort, you have to partition, but can do ~2^7 elements with a range of 2^16, so a ratio of 2^9. Any array larger than 2^23 ~= 8e6 automatically fits that range, and I think smaller arrays usually will too: most 32-bit data is not really 32-bit, since 32 bits was probably chosen as an upper bound on the size of the data. For 8- and 16-bit types I think counting/radix sort should be used almost exclusively (maybe unless an initial scan shows the whole array is just one or a few runs); for 64 bits I think they should be used when possible but they're more niche.

I didn't notice crum_median_of_sqrt was swapping within the array instead of using a buffer (obvious in retrospect), that makes pivot tracking much easier. Also means you could partition them trivially, and add new candidates with a smaller sort+merge, something I'll have to think about. But if you know just the max/min values, you can also pass those around as arguments and avoid pivot shuffling. Another quirk of pivoting is that in the interior, max and min aren't the same: since values equal to the pivot get placed on the left, the min is a strict lower bound but the max isn't (I could have this backwards). I don't take advantage of this in distribution crumsort because the initial minimum isn't strict, and the only way to make it strict would be to move all the minimum elements, or something weird like temporarily reducing the value by 1.

The intermediate results from blockwise min/max could be useful for analysis. For example, if adjacent blocks have little overlap, that indicates that large merges will probably go well, so a good strategy would be to sort roughly block-sized arrays by whatever method and then merge them. At this point I have a strong feeling that our initial pass is much less advanced than the later sorting algorithms and is holding us back. I'll keep thinking and probably write some more notes about this part.

Oh, you can treat the whole sqrt-sized candidate list as one big pivot. After partitioning the rest of the array, swap the second half of it (no need for a search, you know the location of the actual pivot) with elements from the left partition. Then in each half, find additional candidates, sort, and merge with the existing ones. For a perfect partition, c√2n candidates from 2n elements will be reduced to c√n/2 existing candidates from n elements, and this needs to be increased to c√n, which means √1/2 of the required candidates are already there, reducing the number that need to be acquired by 70%. I think that means it's worth starting with a candidate list about 3 times larger. A minor annoyance is that the number of existing candidates needs to be passed into recursive calls.

I observed the compiler wasn't creating optimal code for some parts of trinity, might be something that needs to get fixed by the compiler folks.

I did experiment with a sample sort, but gridsort's adaptive binary search works best with the current scheme. I might have to give it another go, but the performance gain seemed minimal the last time I tried. Right now gridsort is surprisingly optimal with comparison counts on par with mergesort.

Taking better advantage of crumsort's prior pivot selection might be worth looking into. An interesting aspect of crumsort is that the first and last 16 elements are partitioned stable, so a median of 32 scheme might be cost effective.

@mlochbaum

It should all be updated. I changed the setup of several distributions as the pipe organ wasn't a proper natural run, which was giving branched sorts an advantage. Somewhat similar to branched behavior on bit reversal distributions.

I added drop support and min/max finding to wolfsort, rhsort is still quite a bit faster on random, but I figured to focus on different areas.

I'll probably update the readme tomorrow.

Thanks! I've updated the benchmarking code accordingly. Will re-measure everything once I don't have so much stuff running on this computer. pdqsort really doesn't like those saws. I think I'll have to crop the bar chart.

SKIP_DOUBLES and SKIP_LONGS are really helpful too.

I've updated the readme, it now has a benchmark up for rhsort vs wolfsort vs skasort.

For some reason wolfsort is quite slow on random on your benchmark, barely faster than fluxsort. I compile with gcc -O3. I am benching on WSL, so it's a bit atypical platform-wise.

I categorized it as a hybridization of dropsort aka stalinsort. I'm not familiar with RHH, did it introduce dropping first?

I also did a range benchmark, not looking great.. oh well. I totally forgot about distcrum, I'll look into that later.

Edit: I do assume you're not calling wolfsort with a comparison function? That would slow down performance quite a bit. For my benchmarks I set the cmp macro from bench.c.

Not sure what's happening with wolfsort. It's using a div instruction in a hot loop in wolf_partition? That could vary across processors. Also gcc -O3 here, version 12.2.1. I just tested clang and it's maybe 5% slower. I did confirm I'm defining the cmp macro, and if I remove it the wolfsort benchmark slows down a lot and the RH one won't even compile.

Something I just found out about is drop-merge sort, which gets down to 2-3ns/element for mostly-sorted arrays. Seems like a powered-up version of pdqsort's optimistic insertion sort. I translated it for SingeliSort here (hoping it's readable without knowing Singeli).

I saw you describe Robin Hood as a dropsort and have been thinking about it. The dropping is important but not the main mechanism, so I'd consider the closest relative to be bucket sort (I guess you could say radix, but to me the defining feature of radix sort is having multiple passes so that doesn't sit right). It does have a lot of similarities to drop-merge even though I wasn't aware of that algorithm at the time.

Robin Hood hashing doesn't traditionally drop elements; that was something I came up with at Dyalog in ~2018 as described in the history section. I doubt I'm the first but I haven't seen it anywhere else. What I have seen is a version that resizes when the maximum offset gets too large instead of based on total number of elements.

Thanks, I'll look into that div. My latest version added wolfsort_prim(array, nmemb, sizeof(int) + 1) which sets the VAR to unsigned integers, not sure if that makes a difference. The default is signed integers.

Drop-merge sort looks interesting, it seems like a bit of a one-trick pony however? Hard to judge how relevant it would be in the real world. Singeli is hard to read. Is it intended to be a bit esoteric? It does look interesting.

I've worked on a scripting engine of my own, tintin++, it offers a scriptable alternative to ncurses and might be the most capable shell scripting tool out there.

Good point about radix, though you can do a bucket sort with binary searches as well. I will fix up the readme.

In my opinion rhsort is the only use of dropping I've seen that makes sense.

image

Thanks for figuring out how to bench against Rust by the way. Those are some interesting results and show where there might still be some meat left on the bone for flux to improve. As you can see, I dusted off your exponential benchmark test.

I've tried to figure out how to call glide with a comparison function for c strings/doubles/etc, but I'm stumped. Do you by chance know how to pull that off?

My focus has primarily been on slightly heavier comparisons, since the real world utility of sorting arrays of primitives seems limited.

Drop-merge is definitely a one-trick pony. My thinking is that if all your pivot candidates are in order, or only a few are out of place, then you'd try it out. I don't think a random sample is ever enough to rule out bad cases: a list like 1 0 3 2 5 4... will have half the elements dropped, but the chances of a √n sample hitting one of the inversions is about 40%. The overhead on a bad case is small to start with (<10ns/elt I think) and it's easy enough to abort if the drop list gets too long.

Passing in a comparison function to Rust from C sounds very hard; I'd think you would have to write a Rust copy. I don't know anything about how that works. My thinking has been that when branchless comparison isn't possible you may as well write an entirely different algorithm, and that quicksorts don't seem to offer much. But Orson makes the point that sorting an array with a small number of unique strings does come up in practice, and quicksorts do look faster than timsort on his string benchmarks. I didn't see any string benchmarks in your READMEs. Are there any?

Singeli is hard to read. Is it intended to be a bit esoteric? It does look interesting.

That's a little disappointing. Singeli itself is supposed to be flexible, for example all operators are user-specified (include 'skin/c' is built-in to get C-like ones). But with SingeliSort I'm trying to make it look enough like something mainstream like Go to be readable without a lot of effort. The most common departures are reading pointer -> index and writing pointer <-{index} value, and the cast operators ~~ for same-size cast, <~ for narrow, and ^~ for promote. Is there other stuff that's confusing? Considering a few options:

  • Change some operators and names, if you have suggestions. pointer[index] doesn't work though. It has to be infix.
  • Add a Singeli primer to the README that explains enough for someone to get started. Maybe comments in the source to explain weirder features.
  • Give up, write it however I can get it down the fastest, and maybe translate to C/C++ later. C macros and C++ templates really aren't up to the task, might involve a lot of copy-paste.

The main problem I've observed with pivot oversampling using a lossy algorithm is cache pollution.

In my readme, below the bar graph against qsort(), I provide a benchmark table (collapsed by default), which contains a bench for random strings. Quadsort and fluxsort are roughly 1.6x faster than qsort sorting 100K 16 byte strings. Once L3 cache runs out things get dicey though.

To me at least, reading other people's code is very challenging. Add even a slight change to the programming language and most programmers will go tl;dr. Fortunately, some people really enjoy trying new things.

If I was in your position, I would consider "how can I convince people that learning this is worth it". Maybe a YouTube video that walks people through a tutorial. If you'd implement piposort in Singeli, or something I can relate to, it'd be easier for me to judge the added value.

I've been kind of fortunate that I have an existing user base for my utility. I've given up on projects because I didn't see how it could ever gain traction. I created base252 and VTON on a whim, they get a few stars a year from amused readers, which in turn amuses me due to a minimum investment.

So I think I've narrowed the merge performance difference between flux and glide down to this:

// ptd = destination, ptl = left, ptr = right
        x = cmp(ptl, ptr) <= 0;
        *ptd = *ptl;
        ptl += x;
        ptd[x] = *ptr;
        ptr += !x;
        ptd++;

That's my branchless merge. It's a total hack, but it works. It is quite inefficient. Do you have any ideas how to improve that in a way that gcc compiles as branchless?

I'd try saving the value of *ptr, since if the compiler doesn't know the ptd write won't change the value then it has to read it again.

You can avoid the !x by keeping the sum of xs in a variable, and unconditionally incrementing ptr and subtracting the sum. No idea if this is better.

        rv = ptr[-l];
        x = cmp(ptl+l, &rv) <= 0;
        *ptd = ptl[l];
        l += x;
        ptd[x] = rv;
        ptd++;
        ptr++;

(Had some uses of an index d left in from an earlier version that don't make sense any more; just edited them out)

Thanks, not sure if that approach will work, but at least it got me thinking in new directions.

image

I threw in the towel and just went the clang route, which compiles a ternary comparison as branchless out of the box. That's quite the contentious thing for clang to do since gcc can't easily follow suit, but it gets the job done!

Still baffling to me that C compilers haven't defined some way to force a cmov. clang has some problems too: make sure you test on floats too as it can compile those differently.

Thanks for the tip, 32 bit floats specifically?

I'd say ternary is perfect for forcing branchless, but I can imagine clang notably degrades some legacy software.

I've seen it with doubles; haven't tested 32-bit floats but I expect they'd be similar.

Looked it up: this issue reports a problem with ternaries generally, which -mllvm --x86-cmov-converter=0 fixes. The issue I had was that my workaround of using arithmetic on a comparison result doesn't compile branchlessly if it's a double comparison.

So it's somewhat surprising that clang works well here. Have you checked the emitted code to confirm that the merges are branchless and it's not clang speeding something else up?

I double-checked, I'm not doing anything complicated however.

*ptd++ = cmp(ptl, ptr) <= 0 ? *ptl++ : *ptr++;

Binary searching is pretty annoying with clang. My monobound search is one of the slowest, while my double_tapped search is the fastest. Guess I'll be needing an annoying number of #ifdef __clang__ checks.

The performance on my rotations is a mess as well. The overall speed of clang seems poor.

I have SingeliSort in a reasonable shape for benchmarking, and added compiled/sort.c so you can test as well. In that file glide32 has the glidesort outer layer to adapt to runs and flux32 doesn't and is slightly faster if there aren't natural runs. Doesn't compile with C++ sadly. Looks competitive on all Wolfsort's tests. And fairly close to RH on random input:

rand

It doesn't have adaptive merges yet, so it'll probably fall behind quadsort on some things. I made some changes to the partitioning control code here. The thing with most should isolate a pivot with many repetitions faster, and avoids the minor issue that a pivot repeated many times might cause an unbalanced partition even in a case that's good for quicksort. There's also a significant gain from switching between crumsort at large sizes and fluxsort at small ones.

I think I've seen both you and Orson claim that parity merge requires the sides to have equal lengths, but this doesn't seem to be the case? The reason pointers won't collide in the middle is that both sides are sorted, nothing to do with lengths. In fact you could even do a different number of merges on each side if my reasoning's right (this would be pointless). Balanced parity merge might even do an unbalanced merge as part of normal operation, for example if the initial two moves both take from the left side then the rest of the loop does a parity merge on n-2 values from the left and n from the right. Am I missing something here, or did I misinterpret?

I think I found a way to improve drop-merge sort's performance so that bad cases are detectable from a √n sample, and also make it stable. The situation with 1 0 3 2 5 4... isn't nearly as much of an issue as I thought. Provided you sort the dropped values recursively with drop-merge, you still get O(n) performance if the array has O(n) out-of-place pairs (out of n*(n-1)/2 total). The real issue is that if many pairs come from a single very out-of-place element, a sample likely won't detect it, so the dropsort has to handle these cases effectively. If my strategy for doing this works out, I'll make a separate C repository for it like RH sort.

Oh, the arguments are glide32(array, nmemb, swap, swap_bytes). In bench.c I pass in (n + 4*(n<1<<16 ? n : 1<<16))*sizeof(i32) bytes, so 5*n elements for small lengths and n + (1<<18) for large ones. It probably breaks if there aren't n plus a few elements of space, and less than 5*n for smaller sizes would limit the amount of work Robin Hood can do.

I'll try to take a look in a few days. Crumsort should already switch from fulcrum to flux partitioning if the partition fits in memory? It might even be worth it to perform some moves or rotations (like blitsort) to stretch the memory a bit.

SingeliSort is looking good. Is this pure comparison? Very impressive if it is.

Regarding parity merges, it is optimal for random when both sides are equal length, but this is not a strict requirement. I intended it more as an example, as in, imagine two equal length arrays, but I might have to reword that. There are a variety of methods to perform a bi-directional merge. There should indeed be no need for boundary checks with unequal length arrays, but there's some advantage to finishing a merge with a copy to minimize comparisons, instead of going all the way.

But the parity principle indeed appears to boil down to bi-directional characteristics of two sorted arrays regardless of length. I'm not exactly sure what Orson was getting at, I got the impression he was trying to give me as little credit as possible, and was subsequently coming up with convoluted arguments. Very different from the powersort article that mentions Timsort 60 times. Working on sorts has been fun, but I found this aspect of it an upsetting experience.

I think the original dropsort author, or someone else, mentioned that there might be a bias towards reverse order in the dropped elements. That might be something that could be taken advantage of.

image

Having something to compare against was pretty useful. This is compiled with clang, 1 << 17 elements. I fixed a bunch of minor adaptivity issues, and quadsort is now faster than fluxsort for pure random data.

For larger arrays malloc() becomes a serious performance issue, taking up as much as 75% of the total sorting time in some of the cases. So even saving n / 2 memory might be worth it, but since this is an OS issue it will hopefully get fixed eventually?

The 32-bit methods use lots of distribution sorting, leaning heavily on Robin Hood. For testing comparison-only, I included some for doubles, glide_f64 and flux_f64, since none of the distribution methods support floats currently. RH could probably be adapted but my plan is to convert to 64-bit ints (xor the sign bit with the rest of the float) since this also fixes problems with NaNs.

image

It's pretty depressing to bench against rhsort. :) You may want to look into my twice-unguarded insertion, it's pretty epic. I could see that work quite well with rhsort.

Need to work on my small lengths I see. Under 128 elements it will always end up using piposort. But the glidesort layer doesn't special-case this, adding ~10ns, and alloc/free adds another ~15ns in my tests. I don't think I actually want to put glidesort outermost because it means if there are some runs then merging will be chosen over faster options like counting sort. In fact it's just hit me that a single run in the argument breaks the rest of it down from one length-n sort to ~log(n) of them, 128, 256, 512... n/2. Maybe fine when your merges are about as fast as partitions, but not good for distribution sorting. So yeah, it really needs to be gated behind some kind of analysis pass.

Timsort has some very obvious strengths, but it's hard to say whether a timsort layer is worth it on real-world data.

image

Looks like piposort does okay on 33-64 with clang's ternary branchless merges, not sure if rhsort would have additional gains with the same merges.

Bottom-up analyzers are indeed weak to unfavorable runs, it's why I went the top-down approach for fluxsort. Dealing with random % 2 is tricky as well, you pretty much need to set your minimum run length to 32-64 elements.

This then opens up the door to a large performance drop when comparisons are expensive and there are a ton of tiny runs.

I dug into merging some. I was totally wrong when I said parity merge works with any length (and I meant plain parity_merge with no extra checks; not sure I got that across). The problem is that if the shorter side is entirely contained in half of the result, then one of the pointers will run off the end and start reading out-of-bounds, giving wrong results. The one-less merge in piposort does seem to work, but for subtle reasons: the forward-facing left pointer can actually run into the right side, but only on the last step, and it'll be equal to the forward-facing right pointer. The backward-facing left pointer can't run off because there are only n/2 merges in that direction.

I think trimming the ends of the longer side with binary searches is enough to fix this problem if unbalanced merges are wanted. Then the shorter side is effectively guarded, and the longer side still has the trimmed sections as a buffer.

What glidesort does is to take the minimum length, do that many merges, and repeat until done. Which seems disastrous if that minimum is small for a long time, say if one side has a single value in the middle. The interleaved-merge scheme is complicated enough that I'm not sure how to demonstrate this. Glidesort is definitely very bad at unbalanced merges, like an array of two copies of 0, 1,... n-1, -1, n. Having spent some time with glidesort now I definitely get the impression that it's built from only a vague idea of what quadsort and fluxsort do instead of working with the code. It's just that this isn't a good thing.

In my merging code I found that sharing an index between the destination and one of the source pointers is faster than incrementing both of them. Singeli compiles the check to roughly if (!c) goto label; r=l; label:, but gcc and clang both produce branchless code for it. Except gcc's version is way slower for some reason I can't figure out. I also found that unrolling by 2 speeds up the merge a fair amount; glidesort also does this.

It's indeed a bit finicky when it comes to whether it guards or not.

I updated quadsort today and settled on calling the parity merge variant I developed for variable length merges a cross merge. I've tried to explain it in the readme. https://github.com/scandum/quadsort#cross-merge

Basically it looks 8 elements ahead for runs (4 comparisons), and if there are no runs it performs 8 parity merges (16 comparisons). The actual code in quadsort doesn't work with variable length arrays because it tries to default to a pure parity merge if the two arrays appear to have even distributions, without checking if the arrays are of equal length, but with a length check in place it'll work.

I've briefly glanced over the glidesort code and got a similar impression. The whole bidirectional partition thing seems a bit odd to me, not worth going through that much trouble for 5% extra performance. It's less trouble to interleave two parity merges, and it does boost performance with branchless ternary merges in clang, but I figured quadsort has too many lines of code as is, and it'd make more sense for a radix sort to go that route.

Glidesort is pretty good at variable length runs, but that might be the only thing it has going for it relative to fluxsort.

I'll take a closer look at your merge code, looks pretty clean. The whole memory-level parallelism aspect is pretty fuzzy. Looks like you are performing two head merges followed by two tail merges?

I've observed that creating a memory offset can increase performance as well, so both interleaved merges don't have to access a new memory block in the same loop. But stuff that works in gcc doesn't seem to work well in clang, and vice versa.

Then there is the issue that I need it to be fast when sorting tables. So I've discarded quite a few things that were fast on arrays of primitives, but didn't do well on tables.

Edit: clever usage of +i and -i, will have to give that a try!

I got the basic idea of skipping 8 elements when it was called galloping_merge, but never worked through it (was thinking it might be better to downgrade it to "canter", since to me galloping should use powers of 2).

The @for_unroll in parity_merge will do head-tail-head-tail (defined here). If you're looking at parity_merge_const, it's just for 8 or 16-element sorts and I believe I just took the pattern from quadsort.

Agreed about bidirectional partition. It even messes with the pivot selection logic.

I noticed a bit late that glidesort stopped recognizing runs under sqrt(n / 2) in this commit. So unordered data is still chopped into 48-element segments, but natural runs may be ignored (in a way that looks quadratic; submitted an issue about that one). That's one way to do it. I've been considering another strategy of allowing logical runs to be half sorted and half unsorted, so that glidesort always sees four lowest-level runs before deciding what to do. I would definitely throw out one sorted run in the middle, and if it's at one side, that just turns into a bigger sorted+unsorted run. But if there are two runs it's not so obvious any more.

My small-array sorts take a source and destination pointer (this is almost always free, with elements moved to the destination in the first round of exchanges or during insertion), and I realized this gives me better control over how big the piposort base cases are. I can split in four as usual, or in two with the small sorts going from array to swap. Except... this doesn't look like it does anything for performance because what matters is how close the length is to a power of two and not which power of two. That sound right to you?

Not sure if you noticed that quadsort is faster than both glidesort and fluxsort when compiled with clang? I've yet to get piposort to run at the same pace on clang, but it should be possible, and theoretically it can go faster than quadsort on random.

Then the big question becomes how fast we can make a mergesort run on low cardinality data, and if we can do away with quicksort altogether. It might also be feasible to switch to a bucket sort utilizing binary searches using quadsort's low cardinality detection scheme.

Regarding glidesort, it might be beneficial to partition 3 random runs with quicksort, and partition 1 natural run with a binary search, using the same pivot for all 4. I find the term logical merge confusing when it's just performing 'random run' detection. It's little different from quadsort's reverse run detection.

As for piposort and powers of two. Quadsort is optimal for arrays of 128, 512, 2048, etc elements. In the case of piposort this weakness is obscured, but it's probably still there. In the case of glidesort the weakness might be obscured even further because it becomes dependent on the data (and timsort psychology might kick in where it becomes easy to ignore performance loss on random data due to extremely good performance on long natural (unnatural?) runs).

The small sort in piposort becomes quite important. If you're willing to ignore comparison counts, oddeven (bubble) sorting for 1 to 8 elements is very fast, but that'll be an issue when you start comparing strings.

        size_t y, z = 0;
        while (--left)
        {
                x = cmp(ptl+z, ptr-z) <= 0; *ptd = ptl[z]; ptd += x; *ptd = ptr[-z]; ptd += !x; z += x;

Something like this works as well, but it has near identical performance, and even if it was faster, it'd be rather atrocious.

I'd say gcc should bend the knee, and make ternary comparisons branchless. I'd love to see a reverse ternary comparison be added to the C standard, but it'll probably never happen. Might be worth the trouble for someone to write one in assembly just to measure the performance gain for quicksort.

I saw your results about quadsort beating fluxsort, but given that I expect to be using Robin Hood most of the time, I can't do much with it usually. From a high-level sample I can often see if RH will be useful, but not if it won't: partitioning splits the distribution of values into segments, and a globally curvy distribution like zipf often smooths out when looking at a segment.

I have been switching to piposort at the top-level for arrays of less than 192 elements, but I couldn't get it to work as a base case once partitioning starts. It looks like my problem was not supporting a separate source and destination pointer. I added this sort and now switching at 192 is an improvement (gives up adaptivity, since I can't do the no-copy 4 run thing; if I keep it I'll do it with copies or just an adaptive merge). Might close off some counting sort or RH cases, but usually they would have applied before partitioning if they'd apply after.

I've implemented small-array sorting networks and will use these for unstable small-array sorting. Keeping the merge-based ones for now to avoid making any decisions that don't carry over to the unstable case, until I get sort-by implemented. This selection is the same as ipn_unstable which also has its own way of extending these to any size that looks good.

It's been a while since I messed around with trying to optimize by removing the memcpy. Fluxsort might be of interest as it handles a separate source and destination quite smoothly if you call the partitioning routine directly. This might actually work well with wolfsort if I set it to target an average of 20 elements per bucket.

Memcpy might be rather bad when used on uncached memory. I may have to rethink how I handle my tail merges.

Yeah, fluxsort being able to work with a separate source is great (haven't actually carried that through to the outermost function but it'll be trivial). It's important for a language with immutable arrays because many sorts will have to keep the source intact.

I couldn't achieve any performance gains for wolfsort by omitting the memcpy. Regardless, I updated wolfsort and changed the division to a straight power of two.

It also appears reducing the number of buckets for large arrays is helpful for performance. I have wolfsort outperform fluxsort up to 100M elements now.