scandum/quadsort

QUAD_CACHE: Question

Closed this issue · 6 comments

Just a general question about the code here. Why is QUAD_CACHE only applied for gcc? Wouldn't you want to limit to fit in cache with clang as well? Does some sort of llvm optimization manage to work around cache issues?

#if !defined cmp && !defined __clang__ // cache limit workaround for gcc
	if (left > QUAD_CACHE)

The main issue is that branchless executions are ineffective on uncached memory.

So the code is switching from a complex branchless merge in gcc, to the simpler merge that is the default in clang.

I see, so these merges in clang don't actually optimize to branchless:

#define head_branchless_merge(ptd, x, ptl, ptr, cmp)  \
	*ptd++ = cmp(ptl, ptr) <= 0 ? *ptl++ : *ptr++;
	
#define tail_branchless_merge(tpd, x, tpl, tpr, cmp)  \
	*tpd-- = cmp(tpl, tpr) > 0 ? *tpl-- : *tpr--;

How much does that end up hurting perf?

Clang uses a compiler hack that makes most ternary operations branchless, including the merges used by quadsort.

It gives around a 30% performance boost when sorting random data.

That's nice. I don't quite fully follow though. If clang still makes this branchless, why doesn't it need QUAD_CACHE?

Shouldn't it still hit issues with:

The main issue is that branchless executions are ineffective on uncached memory.

Gcc uses these instructions for branchless merging:

 x = cmp(ptl, ptr) <= 0;  *ptd = *ptl;  ptl += x;  ptd[x] = *ptr;   ptr += !x;  ptd++;

Clang uses this:

        *ptd++ = cmp(ptl, ptr) <= 0 ? *ptl++ : *ptr++;

So clang doesn't need QUAD_CACHE, because it's already using a simpler merge routine.

That makes sense. Thanks!