_mm256_insertf128_si256(avx) vs _mm256_inserti128_si256 (avx2)
Lokathor opened this issue · 8 comments
It seems like the docs and even signature for _mm256_insertf128_si256
and _mm256_inserti128_si256
are essentially the same, however based on the names and the Felix Cloutier Notes it seems like _mm256_insertf128_si256
is mis-typed and should be operating on floating data, not integer data.
This would be a fairly simple fix, we can just throw in an extra cast or two if needed. The question is if this conclusion is correct and if we should make this adjustment for the user. Normally I'd be against safe_arch
doing a "fix" like this but in this case it's the intel intrinsics who are wrapping the assembly wrong, so it feels fair to give proper direct access to the assembly.
Pings to:
Intrinsics Guide is down, but here's stuff right out of Intel's manual:
VINSERTF128 __m256 _mm256_insertf128_ps (__m256 a, __m128 b, int offset);
VINSERTF128 __m256d _mm256_insertf128_pd (__m256d a, __m128d b, int offset);
VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);
5-286 Vol. 2C (p. 2178 of the combined manual)
Aaand from that same manual…
VINSERTI128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);
5-290 Vol. 2C (p. 2182 of the combined manual)
Lookin' like a typo, especially given that the manual has no mention of the _mm256_inserti128_si256
version. AMD manuals don't mention the intrinsics.
Anyways, as you can see for the vinsertf128
version, there are three intrinsics of different types that generate the exact same instruction. So you can just keep the types or drop that intrinsic or whatever else. The important ones are _ps
and _pd
.
The vinserti128
version is important to have and keep as is, though. x86 SIMD has two separate data paths for integer and float vectors. Crossing paths adds extra latency, thus slowing down the code. Whatever namings and types you pick, make sure to make this property transparent to users.
Oh, great, now we get to convince T-libs that there's a bug in core
which needs a breaking change to properly fix.
Just great.
I don't know the policies of the libs team here, but I think it's reasonable to make such a change, given a crater run. Thanks for doing the detailed research to back it up.
Can you write a PR for the change you would propose, so we can put it through a crater run?
Yeah, I'll probably have time to do that tomorrow after the T-lang meeting
To clarify, I meant that Intel's VINSERTI128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);
is a typo and should've been _mm256_inserti128_si256
. The signatures may be intended. At least all C headers I know and all the articles out there and all the manuals agree on that, though nonsensical, typing of VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);
.
E: To clarify the clarification, _mm256_insertf128_si256
looks correct, but nonsensical because Intel. _mm256_inserti128_si256
makes sense. My guess is that the signature of the former is old convenience stuff, which when truly used on integer vectors will trigger costly domain transfers. And the latter is there to replace the former with a real cheap integer-domain version of that operation. As such, the docs should state to prefer ignoring the former and only use the latter.
Alright, I spoke with Evrey some more on Discord and we came to the agreement that:
- There's no "actual" error in
core
, Intel just make stuff so dumb it looks like an error. Specifically:_mm256_insertf128_ps
and_mm256_insertf128_si256
both use the AVXVINSERTF128
instruction, which will work with integer data but it'll be slow because you have to move the integer data into the float pipeline, do the insert, and then move it back to the interger pipeline._mm256_insertf128_si256
uses AVX2VINSERTI128
which is the "more correct" thing to do when possible because you don't take the performance penalty of moving your data between units.
- However, since both
_mm256_insertf128_si256
and_mm256_insertf128_si256
would normally have the same name under thesafe_arch
naming scheme, and since there's not much you can really be doing withm256i
if you have AVX but not AVX2 anyway:_mm256_insertf128_si256
will be a macroinsert_m128i_to_m256i_slow_avx
_mm256_inserti128_si256
will be a macroinsert_m128i_to_m256i
I love the typos in your summary.