Lokathor/safe_arch

_mm256_insertf128_si256(avx) vs _mm256_inserti128_si256 (avx2)

Lokathor opened this issue · 8 comments

It seems like the docs and even signature for _mm256_insertf128_si256 and _mm256_inserti128_si256 are essentially the same, however based on the names and the Felix Cloutier Notes it seems like _mm256_insertf128_si256 is mis-typed and should be operating on floating data, not integer data.

This would be a fairly simple fix, we can just throw in an extra cast or two if needed. The question is if this conclusion is correct and if we should make this adjustment for the user. Normally I'd be against safe_arch doing a "fix" like this but in this case it's the intel intrinsics who are wrapping the assembly wrong, so it feels fair to give proper direct access to the assembly.

Pings to:

Evrey commented

Intrinsics Guide is down, but here's stuff right out of Intel's manual:

VINSERTF128 __m256 _mm256_insertf128_ps (__m256 a, __m128 b, int offset);
VINSERTF128 __m256d _mm256_insertf128_pd (__m256d a, __m128d b, int offset);
VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);

5-286 Vol. 2C (p. 2178 of the combined manual)

Aaand from that same manual…

VINSERTI128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);

5-290 Vol. 2C (p. 2182 of the combined manual)

Lookin' like a typo, especially given that the manual has no mention of the _mm256_inserti128_si256 version. AMD manuals don't mention the intrinsics.

Anyways, as you can see for the vinsertf128 version, there are three intrinsics of different types that generate the exact same instruction. So you can just keep the types or drop that intrinsic or whatever else. The important ones are _ps and _pd.

The vinserti128 version is important to have and keep as is, though. x86 SIMD has two separate data paths for integer and float vectors. Crossing paths adds extra latency, thus slowing down the code. Whatever namings and types you pick, make sure to make this property transparent to users.

Oh, great, now we get to convince T-libs that there's a bug in core which needs a breaking change to properly fix.

Just great.

I don't know the policies of the libs team here, but I think it's reasonable to make such a change, given a crater run. Thanks for doing the detailed research to back it up.

Can you write a PR for the change you would propose, so we can put it through a crater run?

Yeah, I'll probably have time to do that tomorrow after the T-lang meeting

Evrey commented

To clarify, I meant that Intel's VINSERTI128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset); is a typo and should've been _mm256_inserti128_si256. The signatures may be intended. At least all C headers I know and all the articles out there and all the manuals agree on that, though nonsensical, typing of VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);.

E: To clarify the clarification, _mm256_insertf128_si256 looks correct, but nonsensical because Intel. _mm256_inserti128_si256 makes sense. My guess is that the signature of the former is old convenience stuff, which when truly used on integer vectors will trigger costly domain transfers. And the latter is there to replace the former with a real cheap integer-domain version of that operation. As such, the docs should state to prefer ignoring the former and only use the latter.

Alright, I spoke with Evrey some more on Discord and we came to the agreement that:

  • There's no "actual" error in core, Intel just make stuff so dumb it looks like an error. Specifically:
    • _mm256_insertf128_ps and _mm256_insertf128_si256 both use the AVX VINSERTF128 instruction, which will work with integer data but it'll be slow because you have to move the integer data into the float pipeline, do the insert, and then move it back to the interger pipeline.
    • _mm256_insertf128_si256 uses AVX2 VINSERTI128 which is the "more correct" thing to do when possible because you don't take the performance penalty of moving your data between units.
  • However, since both _mm256_insertf128_si256 and _mm256_insertf128_si256 would normally have the same name under the safe_arch naming scheme, and since there's not much you can really be doing with m256i if you have AVX but not AVX2 anyway:
    • _mm256_insertf128_si256 will be a macro insert_m128i_to_m256i_slow_avx
    • _mm256_inserti128_si256 will be a macro insert_m128i_to_m256i
Evrey commented

I love the typos in your summary.