Scalar BMI2 for decoding base64 is not being run
nkurz opened this issue · 3 comments
Hi Wojciech --
Thanks for publishing this. I was briefly confused that the scalar BMI2 speed was the same as the SSE BMI2 speed, and then noticed that the test was actually calling the same function twice.
--nate
diff --git a/base64/decode/sse/speed.cpp b/base64/decode/sse/speed.cpp
index b5defa1..95881db 100644
--- a/base64/decode/sse/speed.cpp
+++ b/base64/decode/sse/speed.cpp
@@ -33,7 +33,7 @@ public:
#if defined(HAVE_BMI2_INSTRUCTIONS)
if (cmd.empty() || cmd.has("scalar_bmi2")) {
- measure("scalar & BMI2", base64::sse::decode_bmi2, reference);
+ measure("scalar & BMI2", base64::scalar::decode_lookup1_bmi2, reference);
}
#endif
Also, it seems like the speedup for the scalar BMI2 is due solely to having one 32-bit write rather than 3 8-bit writes.
- *out++ = b0 | (b1 << 6);
- *out++ = (b1 >> 2) | (b2 << 4);
- *out++ = (b2 >> 4) | (b3 << 2);
+ uint32_t dword = b0 | (b1 << 6) | (b2 << 12) | (b3 << 18);
+ *reinterpret_cast<uint32_t*>(out) = dword;
+ out += 3;
After that patch, this is what I see on a Skylake i7-6700 CPU @ 3.40GHz:
nate@skylake:~/git/WojciechMula-toys/base64/decode/sse$ ./speed
input size: 67108864
improved scalar... 0.024
scalar... 0.041 (speed up: 0.59)
scalar & BMI2... 0.041 (speed up: 0.58)
SSE... 0.018 (speed up: 1.33)
SSE & BMI2... 0.016 (speed up: 1.50)
By the way, it would be helpful if you specified more about the CPU you are running the tests on. In particular, the generation (Nehalem, Sandy Bridge, Haswell, Skylake, etc) can be very useful knowledge.
Hi, thanks a lot for the report.
I've also noticed that the single write is the reason of boost, but unfortunately I have no permanent access to Core i7 (Haswell) to verify that.
Thanks, I've finally fixed that mistake in speed prog. :)