QUAD_CACHE: Question
Closed this issue · 6 comments
Just a general question about the code here. Why is QUAD_CACHE
only applied for gcc? Wouldn't you want to limit to fit in cache with clang as well? Does some sort of llvm optimization manage to work around cache issues?
#if !defined cmp && !defined __clang__ // cache limit workaround for gcc
if (left > QUAD_CACHE)
The main issue is that branchless executions are ineffective on uncached memory.
So the code is switching from a complex branchless merge in gcc, to the simpler merge that is the default in clang.
I see, so these merges in clang don't actually optimize to branchless:
#define head_branchless_merge(ptd, x, ptl, ptr, cmp) \
*ptd++ = cmp(ptl, ptr) <= 0 ? *ptl++ : *ptr++;
#define tail_branchless_merge(tpd, x, tpl, tpr, cmp) \
*tpd-- = cmp(tpl, tpr) > 0 ? *tpl-- : *tpr--;
How much does that end up hurting perf?
Clang uses a compiler hack that makes most ternary operations branchless, including the merges used by quadsort.
It gives around a 30% performance boost when sorting random data.
That's nice. I don't quite fully follow though. If clang still makes this branchless, why doesn't it need QUAD_CACHE
?
Shouldn't it still hit issues with:
The main issue is that branchless executions are ineffective on uncached memory.
Gcc uses these instructions for branchless merging:
x = cmp(ptl, ptr) <= 0; *ptd = *ptl; ptl += x; ptd[x] = *ptr; ptr += !x; ptd++;
Clang uses this:
*ptd++ = cmp(ptl, ptr) <= 0 ? *ptl++ : *ptr++;
So clang doesn't need QUAD_CACHE, because it's already using a simpler merge routine.
That makes sense. Thanks!