Simd leading zero count
stevenhoving opened this issue · 2 comments
Some time ago I came across an old book of mine (hackers delight). And for some reason I opened the part about zero counting. Showing the extremely smart tricks to a friend. A week ago I did read your article as it was posted on reddit.com/r/cpp. And this morning I saw your comment on the microsoft stl issue (microsoft/STL#870) related to your article. Long story short I did a 1+1 a figured that I could at the very least give you an working simd sample of one of the branchless hackers delight samples.
I did not benchmark this, and I am kinda throwing it over the fence but I guess you would appreciate it anyway.
int count_zeros_leading(unsigned int v)
{
unsigned a = v & ~(v >> 1);
auto b = (float)a + 0.5f;
auto c = *(int*)&b;
auto d = c >> 23;
return 158 - d;
}
__m128i count_zeros_simd_leading(__m128i v)
{
auto a = _mm_srli_epi32(v, 1);
auto b = _mm_sub_epi32(_mm_set1_epi32(0xFFFFFFFF), a);
auto c = _mm_and_si128(v, b);
auto d = _mm_add_ps(_mm_cvtepi32_ps(c), _mm_set1_ps(0.5));
auto e = _mm_castps_si128(d);
auto f = _mm_srli_epi32(e, 23);
return _mm_sub_epi32(_mm_set1_epi32(158), f);
}
Update:
This works, but its slow. I think best is to do _mm_cmpeq_epi8 + _mm_movemask_epi8 and count the zeros.
Like this:
inline std::uint64_t get_digit_count_from_numeric_mask(__m128i v)
{
const auto mask = _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
const auto count = __lzcnt16(mask);
return 16 - count;
}
Hey! Thanks for reading and suggesting. The movemask is super useful here, thanks!
I made a commit using movemask, and it's shrunk the runtime of the general benchmark from 2.5ns to 1.9ns! Nice. Now need a fix for that pesky left shift.
I had to use tzcnt
instead of lzcnt
because a string can contain other digits after the number we care about parsing, so lzcnt could count too many digits. Hence we need to look at trailing bits. E.g. consider parsing from a string like "1, 2, 3, 4, 5, 6". Also, the mask returned earlier from cmplt
already has the most significant bits set in each byte, so no need to _mm_cmpeq_epi8(v, _mm_setzero_si128())
Perhaps I should add a CONTRIBUTORS file and add you there.
I would appreciate that