Optimize decimal_length_minus_1
Closed this issue ยท 10 comments
As discussed in dtolnay/dragonbox#1, the algorithm for calculating the decimal length is rather inefficient, since the number of digits can be computed with minimal branching using a shr
, add
, and a table lookup instruction as described here.
The current implementation is:
JKJ_FORCEINLINE static constexpr std::uint32_t decimal_length_minus_1(std::uint32_t const v) {
assert(v < 1000000000);
if (v >= 100000000) { return 8; }
if (v >= 10000000) { return 7; }
if (v >= 1000000) { return 6; }
if (v >= 100000) { return 5; }
if (v >= 10000) { return 4; }
if (v >= 1000) { return 3; }
if (v >= 100) { return 2; }
if (v >= 10) { return 1; }
return 0;
}
Which optimizes quite poorly (for example, on x86):
decimal_length_minus_1(unsigned int):
mov eax, 8
cmp edi, 99999999
ja .L1
mov eax, 7
cmp edi, 9999999
ja .L1
mov eax, 6
cmp edi, 999999
ja .L1
mov eax, 5
cmp edi, 99999
ja .L1
mov eax, 4
cmp edi, 9999
ja .L1
mov eax, 3
cmp edi, 999
ja .L1
mov eax, 2
cmp edi, 99
ja .L1
xor eax, eax
cmp edi, 9
seta al
.L1:
ret
Using a built-in ctlz function, we can make this much more efficient:
#ifdef _MSC_VER
#include <intrin.h>
#endif
JKJ_FORCEINLINE static std::uint32_t ctlz(std::uint32_t v) {
// Optimize for GCC/Clang when possible.
#ifdef __has_builtin
#if __has_builtin(__builtin_clz)
return __builtin_clz(v);
#endif
#endif
// Optimize for MSVC
#ifdef _MSC_VER
#if defined(_M_X64) || defined(_M_ARM64)
unsigned long leading_zero = 0;
_BitScanReverse(&leading_zero, v);
return static_cast<std::uint32_t>(31 - leading_zero);
#endif
#endif
// Fallback algorithm
int count = 0;
if (v & std::uint32_t(0xffff0000)) {
v >>= 16;
count += 16;
}
if (v & std::uint32_t(0xff00)) {
v >>= 8;
count += 8;
}
if (v & std::uint32_t(0xf0)) {
v >>= 4;
count += 4;
}
if (v & std::uint32_t(0xc)) {
v >>= 2;
count += 2;
}
if (v & std::uint32_t(0x2)) {
v >>= 1;
count += 1;
}
return 31 - count;
}
JKJ_FORCEINLINE static std::uint32_t fast_log2(std::uint32_t const v) {
return 31 - ctlz(v | 1);
}
JKJ_FORCEINLINE static std::uint32_t decimal_length_minus_1(std::uint32_t const v) {
static constexpr std::uint32_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999};
std::uint32_t y = (9 * fast_log2(v)) >> 5;
y += v > table[y];
return y;
}
Most of the bloat is just to ensure we have a portable way to count leading zeros. And this optimizes quite nicely with the built-in optimizations:
decimal_length_minus_1(unsigned int):
mov eax, edi
mov edx, 31
or eax, 1
bsr eax, eax
xor eax, 31
sub edx, eax
lea eax, [rdx+rdx*8]
shr eax, 5
mov edx, eax
cmp DWORD PTR decimal_length_minus_1(unsigned int)::table[0+rdx*4], edi
adc eax, 0
ret
For MSVC on x86_64, we have:
unsigned int decimal_length_minus_1(unsigned int) PROC ; decimal_length_minus_1, COMDAT
xor eax, eax
mov edx, ecx
or edx, 1
bsr r8d, edx
lea r9d, DWORD PTR [r8+r8*8]
shr r9d, 5
lea r8, OFFSET FLAT:unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table
cmp ecx, DWORD PTR [r8+r9*4]
seta al
add eax, r9d
ret 0
unsigned int decimal_length_minus_1(unsigned int) ENDP
And for MSVC on ARM64, we have:
|unsigned int decimal_length_minus_1(unsigned int)| PROC ; decimal_length_minus_1
orr w8,w0,#1
clz w9,w8
mov w10,#0x1F
sub w10,w10,w9
add w8,w10,w10,lsl #3
adrp x9,|unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table|
add x9,x9,|unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table|
lsr w11,w8,#5
ldr w8,[x9,w11 uxtw #2]
cmp w0,w8
csethi w10
add w0,w10,w11
ret
ENDP ; |unsigned int decimal_length_minus_1(unsigned int)|, decimal_length_minus_1
Since the fallback ctlz algorithm isn't very good, we could do a simple optimization for MSVC and GCC/Clang, and for another other compiler keep the old algorithm.
Good idea, but may I propose another optimization?
JKJ_FORCEINLINE static std::uint32_t fast_log2(std::uint32_t const v) {
return 31 - ctlz(v | 1);
}
Change - to ^, and GCC will emit three opcodes less.
...though MSVC prefers -, so may want to ifdef it. (Clang optimizes both equally.)
e: turns out MSVC prefers - only because you use - in that ctlz function. Easy fix.
Ooh that could work nicely too (sorry, my initial reply assumed you meant ctlz(v ^ 1)
, which would not work and fail spectacularly when v == 0
).
I'm actually aware of this trick. The reason why I'm hesitant to applying it is that actually it results in slower (even if shorter) code in terms of the uniform average performance, b/c most numbers are long (i.e., have many digits). Of course in practice short numbers occur more often than long numbers so uniform average performance is not the best metric, but it is not a negligible metric neither. Also, I think the resulting performance boost/regression may not be in the robustly measurable range.
However, I think at least the code size benefit is obvious, and I guess it does have a meaningful impact on inlinability.
Currently I cannot afford enough time to carefully analyze the benefit/cost of this. Maybe later someday.
I'm actually aware of this trick. The reason why I'm hesitant to applying it is that actually it results in slower (even if shorter) code in terms of the uniform average performance, b/c most numbers are long (i.e., have many digits). Of course in practice short numbers occur more often than long numbers so uniform average performance is not the best metric, but it is not a negligible metric neither.
Just from the implementation standpoint, this is definitely true for how binary64 calls the algorithm, since by default the policy isn't to trim trailing zeros, so there's a high likelihood even with common floats of having a large number of trailing zeros. But, this isn't necessarily true for binary32 representations, which trim trailing zeros prior to this being called, or if trailing zeros is called for binary64, for common floats.
Also, this effectively amortizes the time of calculating the number of digits, while providing good performance for common cases, which I believe should be a desirable property.
Also, I think the resulting performance boost/regression may not be in the robustly measurable range.
This probably isn't a bottleneck, so I'd have to agree.
Currently I cannot afford enough time to carefully analyze the benefit/cost of this. Maybe later someday.
I look forward to it. This probably minimally impacts overall performance, just due to other bottlenecks in the room, so I understand this is very low priority.
Another candidate: https://godbolt.org/z/czrK918KP.
(Note that the input number is never 0 in our case so we don't need to take care of that special case.)
I think this implementation will be slightly better than what @Alexhuszagh proposed, as it uses one less table entries. It uses one or two more instructions, but as adc
is usually quite slow, I expect this one to actually perform better than the one with adc
.
But anyway, I'm leaning towards the side of leaving it as it is, since the suggested changes don't seem to provide a better performance in general. But I will leave this issue open until someone provides a more comprehensive analysis.
EDIT: Sorry, I the number of table entries must be same. Thus the implementation above should be modified a little bit.
Another candidate: https://godbolt.org/z/czrK918KP.
(Note that the input number is never 0 in our case so we don't need to take care of that special case.)
I think this implementation will be slightly better than what @Alexhuszagh proposed, as it uses one less table entries. It uses one or two more instructions, but as
adc
is usually quite slow, I expect this one to actually perform better than the one withadc
.But anyway, I'm leaning towards the side of leaving it as it is, since the suggested changes don't seem to provide a better performance in general. But I will leave this issue open until someone provides a more comprehensive analysis.
Looks good, I'll probably benchmark this with a few contrived examples to see if it makes any difference. If not, I'll close this.
not if i benchmark them first
I took five ilog10 designs, reimplemented them, and tossed them into Quick-Bench: https://quick-bench.com/q/ohRRzsfTxtVf7sAFLZ_D6q8fcWU https://quick-bench.com/q/pwAl0hVZ6wpxe5c6W4J8cAOrWck (Quick-Bench glitches up if there are too many tests, had to split it)
ilog10a is the current one as quoted in OP; it simply compares to powers of 10 in descending order.
ilog10b is a trivial variant of ilog10a, which compares to 100000 first, halving the worst-case number of comparisons (but adding one more comparison for large numbers).
ilog10c is OP's proposal, using log2 to find which power of 10 to compare to (except I replaced *9>>5 with *9/32 to remove a set of parentheses - it optimizes back to bitshift).
ilog10d is ilog10c, but another lookup table instead of *9/32.
ilog10e is jeon's counterproposal, which uses a different table, such that the conditional can be replaced with an addition and a few bitshifts. A quite clever design.
The functions are benchmarked six times each. First with randomly distributed odd numbers 1 through 0x1FFFFFFF = 536870911 (it stops below 4294967295 to not hit ilog10a's assert), second with numbers 1 through 255, third with each number length picked sequentially (length 1 then 2 then 3 then 4...), then numbers of uniformly distributed random length, and the last two exclusively test the numbers 5 and 55555555.
The answers are quite conclusive: jeon's proposal, E, has vastly superior performance characteristics (though D wins on code size ignoring tables, and A wins in readability by beginners, if speed isn't your top priority).
A and B do an excellent job on many input distributions, often finishing 2x faster than C and D, but they also fall apart completely if the branch predictor can't do its job. And, as mentioned, their code size is rather unimpressive. B wins over A for short inputs, but A wins for longer ones, as expected. (Oddly enough, B remains the slowest for random lengths.)
C and D have slower best case (jeon is right, adc is slow), but their speed remains constant across the board, and they easily defeat A and B at their worst. D is uniformly slightly slower than C. (Their speed does vary a little; I suspect either it's just noise, or the branch predictor does something to the adc.)
E is, like C and D, constant time, and it's about 3x faster than them. It wins over everything else on five of six input distributions (sometimes by a large margin, sometimes narrowly), only losing to A and B on the large constant, where both are already fast in absolute numbers.
e: part of this analysis was written for an earlier version of the benchmark, which didn't shuffle the inputs. Fixed.
e2: Added two more benchmarks, repeatedly testing the constants 5 and 55555555. Also added A vs B, and C vs D.
e3: Created a sixth implementation. It's inferior to E in every way, not gonna analyze it. https://quick-bench.com/q/qZ7CimPBeYMEA7_ZT5bzF5J-dqA
@Alcaro Thanks a lot.
D is uniformly slightly slower than C. (Their speed does vary a little; I suspect either it's just noise, or the branch predictor does something to the adc.)
This is expected. Multiply by 9 can be done very fast by abusing the lea
instruction, so *9/32
is nothing but lea
and then shr
. Definitely better than a table lookup.
It seems that GCC actually favors C over E. Don't know why. Looking at the assembly, I think Clang actually does a better job here, so it's likely GCC's fault.
So it sounds reasonable to actually adapt ilog10e
. I'll update the code shortly.
ilog10e is jeon's counterproposal, which uses a different table, such that the conditional can be replaced with an addition and a few bitshifts. A quite clever design.
This is not my idea. I recall I saw this when I was reading Dr. Lemire's blog. I'll find the one who suggested this and correctly attribute the inventor through the comment.
That'd be George Spelvin, and everyone who contributed to the versions his idea is based on (Josh Bleecher Snyder,
Kendall Willets, and of course Daniel Lemire himself). https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/#comment-586067
Willets' idea (the blog OP, henceforth known as G) is also worth testing. While it needs a bigger table (256 bytes for 32bit inputs, 1KB for 64bit - E only needs 36 or 152, respectively), it needs fewer instructions, at least for 32bit inputs on 64bit hardware. On 32bit hardware or 64bit inputs, it gains a few more instructions, but it's still worth investigating both variants. https://quick-bench.com/q/-Ar6_ubBXxxUeesHRZB1yQdgf0k (Quick-Bench doesn't offer 32bit hardware, so only the latter is tested.)
Turns out G is the fastest; G64 is identical to E, but G32 defeats both. And on GCC, the two Gs are identical, both being a fair bit faster than E. I don't know how the extra adc is free; I vaguely suspect instruction scheduling shenanigans, possibly involving false dependencies. https://quick-bench.com/q/07u1DKbzK9MOD8HH7Nj8QGUK7K8
Considering how rarely it's called (once per formatted number), I'm personally leaning towards E, but it's your choice.
(C++20 added ilog2 to the standard library, as std::countl_zero / std::bit_width. C++26 should totally add an ilog10.)