Implement all x86 vendor intrinsics

Question

Implement all x86 vendor intrinsics

alexcrichton opened this issue 7 years ago · 50 comments

Answer 1 · 2017-09-25T19:59:31.000Z

cc @BurntSushi @gnzlbg, I've opened this up and moved TODO.md out here, I figure it may be easier to collaborate here to ensure we can attach names everywhere!

Answer 2 · 2017-09-25T23:32:38.000Z

Could you edit the guide to suggest unsafe functions for the intrinsics? #21

Answer 3 · 2017-09-26T01:37:42.000Z

@mattico makes sense yeah! Although we may want to wait until #21 is closed out to avoid inconsistencies

Answer 4 · 2017-09-26T18:22:14.000Z

For those wishing to implement intrinsics above SSE2, make sure you're running your tests with RUSTFLAGS="-C target-cpu=native" cargo test on something which supports that instruction set extension. It looks lilke it's only running the SSE2 tests otherwise.

Answer 5 · 2017-09-26T18:41:16.000Z

You can use `RUSTFLAGS="-C target-feature=+avx2" to enable a particular extension. Note however that a CPU that does support the extension is needed for running the tests. To develop tests for a different architecture (e.g. develop for ARM from x86) you can use cross-compilation. To run the tests... travis is an option. I don't know if there is a better option though.

Answer 6 · 2017-09-26T18:42:54.000Z

It looks like travis only runs SSE2 and below with our current config. I wonder if their machines support AVX...

Answer 7 · 2017-09-26T18:44:12.000Z

@AdamNiederer oh that's actually a bug! I think I see what's going on though, I'll submit a fix.

Answer 8 · 2017-09-26T18:46:41.000Z

@alexcrichton https://github.com/rust-lang-nursery/stdsimd/blob/master/ci/run.sh probably needs to set RUSTFLAGS="-C target-cpu=native" to run most tests. @AdamNiederer makes a point though, what instruction sets does travis support? If it doesn't support AVX2, those will never be tested (I am pretty sure travis does not support AVX512, so we'll need a different solution for that).

Answer 9 · 2017-09-26T18:48:18.000Z

Added in #45. Let's see what Travis has to say about it.

EDIT: The build is failing, but those same 20 tests were failing for me on my Ivy Bridge box last night. I think LLVM might be spitting out wider version of 128 or 64-wide instructions on CPUs which support them. It also looks like travis supports AVX2 🎉

Answer 10 · 2017-09-26T19:32:59.000Z

@gnzlbg oh I'm going to add cfg_feature_enabled! to all tests and enable them all unconditionally all the time, that way whatever your cpu supports we'll be testing everything (without any required interaction)

@AdamNiederer thanks! I'll look into the failures and see if I can fix them.

Answer 11 · 2017-09-28T12:28:45.000Z

Interested in helping out with this. Figured I'd start super small with cvtps2dq #65

Answer 12 · 2017-09-29T15:41:58.000Z

Hello, I've given a try at __mm256_div_ps and its double counterpart, see #73.

Answer 13 · 2017-09-30T14:02:46.000Z

Post #81 SSE 4.2 should be covered.

Answer 14 · 2017-09-30T14:08:42.000Z

@dlrobertson Awesome! I've updated the checklist.

Answer 15 · 2017-10-05T11:54:41.000Z

I've got an implementation for _mm256_{hadd,hsub}_{ps,pd} in #95.

Answer 16 · 2017-10-06T12:54:12.000Z

What is the plan with FMA, is there a reason behind omitting it in the above list?

Answer 17 · 2017-10-06T15:15:07.000Z

Here are some intrinsics that are in the TODO, but are already implemented.

sse

_mm_getcsr _mm_setcsr _MM_GET_EXCEPTION_STATE _MM_SET_EXCEPTION_STATE _MM_GET_EXCEPTION_MASK _MM_SET_EXCEPTION_MASK _MM_GET_ROUNDING_MODE
_MM_SET_ROUNDING_MODE _MM_GET_FLUSH_ZERO_MODE
_MM_SET_FLUSH_ZERO_MODE _mm_prefetch _mm_sfence

sse2

_mm_cvtpd_epi32 _mm_cvtsd_si32 _mm_cvtsd_ss _mm_cvtss_sd _mm_cvttpd_epi32 _mm_cvttsd_si32 _mm_cvttps_epi32 _mm_load_pd (no tests) _mm_store_pd (no tests) _mm_load1_pd

sse3

_mm_addsub_ps _mm_addsub_ps _mm_hadd_pd _mm_hadd_ps _mm_hsub_pd _mm_hsub_ps _mm_lddqu_si128 _mm_movedup_pd _mm_loaddup_pd _mm_movehdup_ps _mm_moveldup_ps

ssse3

_mm_alignr_epi8

avx

_mm256_and_pd _mm256_and_ps _mm256_andnot_pd _mm256_andnot_ps _mm256_blend_pd _mm256_blend_ps _mm256_blendv_pd _mm256_blendv_ps _mm256_div_pd _mm256_div_ps _mm256_dp_ps _mm256_hadd_pd _mm256_hadd_ps _mm256_hsub_pd _mm256_hsub_ps _mm256_or_pd _mm256_or_ps _mm256_shuffle_pd _mm256_shuffle_ps _mm256_xor_pd _mm256_xor_ps _mm256_cvtepi32_pd _mm256_cvtepi32_ps _mm256_cvtpd_ps _mm256_cvtps_epi32 _mm256_cvtps_pd _mm256_cvttpd_epi32 _mm256_cvtpd_epi32 _mm256_cvttps_epi32 _mm256_extractf128_ps _mm256_extractf128_pd _mm256_extractf128_si256 _mm256_extract_epi8 _mm256_extract_epi16 _mm256_extract_epi32 _mm256_extract_epi64 _mm256_zeroall _mm256_zeroupper _mm256_permutevar_ps _mm_permutevar_ps _mm256_permute_ps _mm256_undefined_ps _mm256_undefined_pd _mm256_undefined_si256

avx2

_mm256_alignr_epi8 _mm256_movemask_epi8

Answer 18 · 2017-10-06T17:45:37.000Z

@p32blo updated!

Answer 19 · 2017-10-08T10:12:11.000Z

_mm256_blend_ps and _mm256_shuffle_ps are not implemented.
When I try, I have to kill cargo/rustc: it seems that the macros expansion is too complex (8 levels).

Answer 20 · 2017-10-12T13:13:02.000Z

This post should add how to document the intrinsics.

Answer 21 · 2017-10-12T13:13:33.000Z

@rroohhh it should be part of AVX2 although we might want to implement it in its own module.

Answer 22 · 2017-10-12T18:00:14.000Z

@alexcrichton this issue's topic is quite long and hard to browse, could you please use something like the mechanism described in this comment, to allow collapsing individual sections?

Something like this

Some intrinsic

Code for the above:

<details><summary>Something like this</summary><p>
       << This line break is necessary!
- [ ] Some intrinsic
</p></details>

Answer 23 · 2017-10-15T20:21:32.000Z

@alexcrichton Could you please check off the following tasks in the SSE section?

everything from _mm_and_ps until _mm_ucomineq_ss
everything from _mm_set_ss until _mm_loadr_ps

For _mm_stream_ps please annotate it with a link to #114

Answer 24 · 2017-10-16T15:48:06.000Z

@nominolo done1

Answer 25 · 2017-11-07T08:37:09.000Z

Note that the _mm256_cvtps_ph AVX-1 instructions are missing from the list. These might require extra work since they operate on half-floats but Rust does not support them yet.

Answer 26 · 2017-11-07T09:41:06.000Z

@gnzlbg Support for half-floats is provided by the half-rs crate. In fact, half-rs already exposes these LLVM intrinsics.

Answer 27 · 2017-11-07T09:51:18.000Z

@GabrielMajeri Maybe I am misunderstanding the situation (so please correct me), but what I had in mind is that the vector types would need to be f16x8 (with a half-float upfront), so functions like extract and insert on those vector types would need to somehow deal with half-floats (I think it would be weird if extract on a f16x8 would return an f32).

Also, 1195 ARM NEON intrinsics operate on half-float vectors directly as well (e.g. pub unsafe fn vmaxv_f16(a: f16x4) -> f16) so this is something that we might need to get right anyways to support those.

Answer 28 · 2017-11-07T09:57:26.000Z

@gnzlbg I wasn't aware of the situation on ARM.

Intel's recommendation on x86 is to only use half-floats to reduce memory bandwidth or improve space usage.

Doing any sort of actual operation on them is after you load them into float or double registers, which is why besides the packing / unpacking features there is no support for extracting certain values or anything like that.

It seems that ARM indeed has support for operating on the values, so there might be some more work involved there.

Answer 29 · 2017-11-07T10:21:26.000Z

Intel's recommendation on x86 is to only use half-floats to reduce memory bandwidth or improve space usage.

I think that information might be slightly outdated. AVX-512 still doesn't have any instructions to directly operate on single f16s, but AVX-512 4VNNIW (Vector Neural Network Instructions Word variable precision) adds some newer instructions for directly working on 16-bit float vectors.

Answer 30 · 2017-11-09T17:15:22.000Z

Is this missing SSE4a intrinsics?

Answer 31 · 2017-11-28T10:46:54.000Z

I have updated the parent post with:

recently implemented SSE intrinsics (SSE is almost finished)
list of MMX intrinsics (required to implement some of the missing SSE ones)
list of SSE4a intrinsics

Answer 32 · 2018-01-29T15:38:52.000Z

Alright we're actually quite close to finishing this off! I've upated the lists above with up-to-date instructions as well as an updated list of remaining intrinsics (all the checked off ones are removed for now).

Answer 33 · 2018-01-29T19:12:26.000Z

Many of those functions are part of intel's SVML, and don't map to a single instruction. Do we intend to link stdsimd against Intel's library for those? I also don't think we should include them in the SSE/AVX sections as they're listed above.

Answer 34 · 2018-01-29T20:50:26.000Z

@AdamNiederer oh interesting! I'm actually not sure what that is (SVML), could you expand a bit on what that is?

I was noticing though that some of the trigonometry-related functions weren't defined in either clang/gcc, which means we probably shouldn't be doing it just yet!

Answer 35 · 2018-01-29T21:30:30.000Z

@alexcrichton long story short:

The Intel® C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. ... The SVML intrinsics do not have any corresponding instructions. The prototypes for the SVML intrinsics are available in the immintrin.h file.

https://software.intel.com/en-us/node/524289

Answer 36 · 2018-01-29T21:32:04.000Z

The SVML is just a bunch of inlining-friendly assembly-level subroutines which use SSE/AVX instructions to compute higher-level mathematical primitives. I'm pretty sure it's "just another library", otherwise. It's heavily optimized for Intel CPUs, much like ICC. I'm also pretty sure it's not open-source or readily available.

Answer 37 · 2018-01-30T00:46:04.000Z

@alexcrichton , sse instructions are split into 3 folders: i586, i686 and x86_64. How should I know where to put an implementation for _mm_log2_pd, for example? It is not obvious for me.

Answer 38 · 2018-01-30T06:03:13.000Z

@crypto-universe @AdamNiederer ok cool, thanks for the info! Sounds like I should omit those intrinsics. I've updated the OP to omit the SVML intrinsics.

@crypto-universe oh the division between those modules is somewhat non-important now. The main one is that x86_64 is only compiled on 64-bit targets, but 32-bit targets compile both i586 and i686. If the intrinsic only works on x86_64 it should go there, otherwise either of the other modules is fine.

Answer 39 · 2018-02-11T16:29:51.000Z

Ok I think this is effectively "done enough" that we can close and follow up with more specific issues if need be. Thanks so much for everyone's help on this!

Answer 40 · 2020-09-17T21:44:15.000Z

Is this the right place to mention that core::arch is missing RISC-V support or should I open a tracking bug? (I'm specifically interested in adding support for the equivalent of rdtsc).

Answer 41 · 2020-09-17T23:48:53.000Z

We generally try to stick to vendor-specified intrinsics, e.g. SSE intrinsics and ARM NEON intrinsics. AFAIK RISC-V doesn't have any target-specific intrinsics defined in GCC or Clang.

Answer 42 · 2020-09-17T23:55:34.000Z

Ough. Thanks. I can see your reasoning, but that raises the bar by orders of magnitude and pushes the problem to all clients of core::arch :(

Answer 43 · 2020-09-17T23:56:28.000Z

You can always just use inline assembly if you really want a specific instruction...

Answer 44 · 2020-09-17T23:57:15.000Z

That's literally what "pushes the problem to all clients" means.

Answer 45 · 2020-09-18T00:03:36.000Z

Probably best to just open a new issue where it can get eyes and discussion. The tail end of a long-closed issue isn't a good way to bring your problem to light.

Answer 46 · 2020-09-18T00:07:26.000Z

@Amanieu It doesn't look like there are any RISC-V intrinsics in llvm/clang yet, but there is some recent work in that area: https://www.sifive.com/blog/risc-v-vector-extension-intrinsic-support

Answer 47 · 2020-09-18T00:09:27.000Z

Those are actually much trickier than it seems since they involve scalable vectors with a size not known at compile-time. This requires special support in the compiler. The same issue applies to the ARM SVE intrinsics.

Answer 48 · 2020-12-09T21:27:33.000Z

Out of interest and because it has recently become relevant: ["VMX"] would be helpful.

Answer 49 · 2021-05-14T11:50:27.000Z

Maybe it's worth to open separate issue for each target feature? For example, I wanted to use _mm_stream_load_si128 and was quite surprised that std::arch::x86_64 does not have it.

Is there a reason why streaming load intrinsics were omitted?

Answer 50 · 2021-05-14T12:11:10.000Z

Please open a new issue if there are any missing intrinsic.

Implement all x86 vendor intrinsics

Writing the implementation

References

TODO