Implement all x86 vendor intrinsics
alexcrichton opened this issue ยท 50 comments
This is intended to be a tracking issue for implementing all vendor intrinsics in this repository.
This issue is also intended to be a guide for documenting the process of adding new vendor intrinsics to this crate.
If you decide to implement a set of vendor intrinsics, please check the list below to make sure somebody else isn't already working on them. If it's not checked off or has a name next to it, feel free to comment that you'd like to implement it!
At a high level, each vendor intrinsic should correspond to a single exported Rust function with an appropriate target_feature
attribute. Here's an example for _mm_adds_epi16
:
/// Add packed 16-bit integers in `a` and `b` using saturation.
#[inline]
#[target_feature(enable = "sse2")]
#[cfg_attr(test, assert_instr(paddsw))]
pub unsafe fn _mm_adds_epi16(a: __m128i, b: __m128i) -> __m128i {
unsafe { paddsw(a, b) }
}
Let's break this down:
- The
#[inline]
is added because vendor intrinsic functions generally should always be inlined because the intent of a vendor intrinsic is to correspond to a single particular CPU instruction. A vendor intrinsic that is compiled into an actual function call could be quite disastrous for performance. - The
#[target_feature(enable = "sse2")]
attribute intructs the compiler to generate code with thesse2
target feature enabled, regardless of the target platform. That is, even if you're compiling for a platform that doesn't supportsse2
, the compiler will still generate code for_mm_adds_epi16
as ifsse2
support existed. Without this attribute, the compiler might not generate the intended CPU instruction. - The
#[cfg_attr(test, assert_instr(paddsw))]
attribute indicates that when we're testing the crate we'll assert that thepaddsw
instruction is generated inside this function, ensuring that the SIMD intrinsic truly is an intrinsic for the instruction! - The types of the vectors given to the intrinsic should match exactly the types as provided in the vendor interface. (with things like
int64_t
translated toi64
in Rust) - The implementation of the vendor intrinsic is generally very simple. Remember, the goal is to compile a call to
_mm_adds_epi16
down to a single particular CPU instruction. As such, the implementation typically defers to a compiler intrinsic (in this case,paddsw
) when one is available. More on this below as well. - The intrinsic itself is
unsafe
due to the usage of#[target_feature]
Once a function has been added, you should also add at least one test for basic functionality. Here's an example for _mm_adds_epi16
:
#[simd_test = "sse2"]
unsafe fn test_mm_adds_epi16() {
let a = _mm_set_epi16(0, 1, 2, 3, 4, 5, 6, 7);
let b = _mm_set_epi16(8, 9, 10, 11, 12, 13, 14, 15);
let r = _mm_adds_epi16(a, b);
let e = _mm_set_epi16(8, 10, 12, 14, 16, 18, 20, 22);
assert_eq_m128i(r, e);
}
Note that #[simd_test]
is the same as #[test]
, it's just a custom macro to enable the target feature in the test and generate a wrapper for ensuring the feature is available on the local cpu as well.
Finally, once that's done, send a PR!
Writing the implementation
An implementation of an intrinsic (so far) generally has one of three shapes:
- The vendor intrinsic does not have any corresponding compiler intrinsic, so you must write the implementation in such a way that the compiler will recognize it and produce the desired codegen. For example, the
_mm_add_epi16
intrinsic (note the missings
inadd
) is implemented viasimd_add(a, b)
, which compiles down to LLVM's cross platform SIMD vector API. - The vendor intrinsic does have a corresponding compiler intrinsic, so you must write an
extern
block to bring that intrinsic into scope and then call it. The example above (_mm_adds_epi16
) uses this approach. - The vendor intrinsic has a parameter that must be a constant value when given to the CPU instruction, where that constant is often a parameter that impacts the operation of the intrinsic. This means the implementation of the vendor intrinsic must guarantee that a particular parameter be a constant. This is tricky because Rust doesn't (yet) have a stable way of doing this, so we have to do it ourselves. How you do it can vary, but one particularly gnarly example is
_mm_cmpestri
(make sure to look at theconstify_imm8!
macro).
References
All intel intrinsics can be found here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236
The compiler intrinsics available to us through LLVM can be found here: https://gist.github.com/anonymous/a25d3e3b4c14ee68d63bd1dcb0e1223c
The Intel vendor intrinsic API can be found here: https://gist.github.com/anonymous/25d752fda8521d29699a826b980218fc
The Clang header files for vendor intrinsics can also be incredibly useful. When in doubt, Do What Clang Does:
https://github.com/llvm-mirror/clang/tree/master/lib/Headers
TODO
["MMX"]
-
_mm_srli_pi16
-
_mm_srl_pi16
-
_mm_mullo_pi16
-
_mm_slli_si64
-
_mm_mulhi_pi16
-
_mm_srai_pi16
-
_mm_srli_si64
-
_mm_and_si64
-
_mm_cvtsi32_si64
-
_mm_cvtm64_si64
-
_mm_andnot_si64
-
_mm_packs_pu16
-
_mm_madd_pi16
-
_mm_cvtsi64_m64
-
_mm_cmpeq_pi16
-
_mm_sra_pi32
-
_mm_cvtsi64_si32
-
_mm_cmpeq_pi8
-
_mm_srai_pi32
-
_mm_sll_pi16
-
_mm_srli_pi32
-
_mm_slli_pi16
-
_mm_srl_si64
-
_mm_empty
-
_mm_srl_pi32
-
_mm_slli_pi32
-
_mm_or_si64
-
_mm_sll_si64
-
_mm_sra_pi16
-
_mm_sll_pi32
-
_mm_xor_si64
-
_mm_cmpeq_pi32
cc @BurntSushi @gnzlbg, I've opened this up and moved TODO.md
out here, I figure it may be easier to collaborate here to ensure we can attach names everywhere!
For those wishing to implement intrinsics above SSE2, make sure you're running your tests with RUSTFLAGS="-C target-cpu=native" cargo test
on something which supports that instruction set extension. It looks lilke it's only running the SSE2 tests otherwise.
You can use `RUSTFLAGS="-C target-feature=+avx2" to enable a particular extension. Note however that a CPU that does support the extension is needed for running the tests. To develop tests for a different architecture (e.g. develop for ARM from x86) you can use cross-compilation. To run the tests... travis is an option. I don't know if there is a better option though.
It looks like travis only runs SSE2 and below with our current config. I wonder if their machines support AVX...
@AdamNiederer oh that's actually a bug! I think I see what's going on though, I'll submit a fix.
@alexcrichton https://github.com/rust-lang-nursery/stdsimd/blob/master/ci/run.sh probably needs to set RUSTFLAGS="-C target-cpu=native"
to run most tests. @AdamNiederer makes a point though, what instruction sets does travis support? If it doesn't support AVX2, those will never be tested (I am pretty sure travis does not support AVX512, so we'll need a different solution for that).
Added in #45. Let's see what Travis has to say about it.
EDIT: The build is failing, but those same 20 tests were failing for me on my Ivy Bridge box last night. I think LLVM might be spitting out wider version of 128 or 64-wide instructions on CPUs which support them. It also looks like travis supports AVX2 ๐
@gnzlbg oh I'm going to add cfg_feature_enabled!
to all tests and enable them all unconditionally all the time, that way whatever your cpu supports we'll be testing everything (without any required interaction)
@AdamNiederer thanks! I'll look into the failures and see if I can fix them.
Interested in helping out with this. Figured I'd start super small with cvtps2dq
#65
Hello, I've given a try at __mm256_div_ps
and its double counterpart, see #73.
Post #81 SSE 4.2 should be covered.
@dlrobertson Awesome! I've updated the checklist.
I've got an implementation for _mm256_{hadd,hsub}_{ps,pd}
in #95.
What is the plan with FMA, is there a reason behind omitting it in the above list?
Here are some intrinsics that are in the TODO, but are already implemented.
sse
_mm_getcsr
_mm_setcsr
_MM_GET_EXCEPTION_STATE
_MM_SET_EXCEPTION_STATE
_MM_GET_EXCEPTION_MASK
_MM_SET_EXCEPTION_MASK
_MM_GET_ROUNDING_MODE
_MM_SET_ROUNDING_MODE
_MM_GET_FLUSH_ZERO_MODE
_MM_SET_FLUSH_ZERO_MODE
_mm_prefetch
_mm_sfence
sse2
_mm_cvtpd_epi32
_mm_cvtsd_si32
_mm_cvtsd_ss
_mm_cvtss_sd
_mm_cvttpd_epi32
_mm_cvttsd_si32
_mm_cvttps_epi32
_mm_load_pd
(no tests) _mm_store_pd
(no tests) _mm_load1_pd
sse3
_mm_addsub_ps
_mm_addsub_ps
_mm_hadd_pd
_mm_hadd_ps
_mm_hsub_pd
_mm_hsub_ps
_mm_lddqu_si128
_mm_movedup_pd
_mm_loaddup_pd
_mm_movehdup_ps
_mm_moveldup_ps
ssse3
_mm_alignr_epi8
avx
_mm256_and_pd
_mm256_and_ps
_mm256_andnot_pd
_mm256_andnot_ps
_mm256_blend_pd
_mm256_blend_ps
_mm256_blendv_pd
_mm256_blendv_ps
_mm256_div_pd
_mm256_div_ps
_mm256_dp_ps
_mm256_hadd_pd
_mm256_hadd_ps
_mm256_hsub_pd
_mm256_hsub_ps
_mm256_or_pd
_mm256_or_ps
_mm256_shuffle_pd
_mm256_shuffle_ps
_mm256_xor_pd
_mm256_xor_ps
_mm256_cvtepi32_pd
_mm256_cvtepi32_ps
_mm256_cvtpd_ps
_mm256_cvtps_epi32
_mm256_cvtps_pd
_mm256_cvttpd_epi32
_mm256_cvtpd_epi32
_mm256_cvttps_epi32
_mm256_extractf128_ps
_mm256_extractf128_pd
_mm256_extractf128_si256
_mm256_extract_epi8
_mm256_extract_epi16
_mm256_extract_epi32
_mm256_extract_epi64
_mm256_zeroall
_mm256_zeroupper
_mm256_permutevar_ps
_mm_permutevar_ps
_mm256_permute_ps
_mm256_undefined_ps
_mm256_undefined_pd
_mm256_undefined_si256
avx2
_mm256_alignr_epi8
_mm256_movemask_epi8
@p32blo updated!
_mm256_blend_ps
and _mm256_shuffle_ps
are not implemented.
When I try, I have to kill cargo/rustc: it seems that the macros expansion is too complex (8 levels).
This post should add how to document the intrinsics.
@rroohhh it should be part of AVX2 although we might want to implement it in its own module.
@alexcrichton this issue's topic is quite long and hard to browse, could you please use something like the mechanism described in this comment, to allow collapsing individual sections?
Something like this
- Some intrinsic
Code for the above:
<details><summary>Something like this</summary><p>
<< This line break is necessary!
- [ ] Some intrinsic
</p></details>
@alexcrichton Could you please check off the following tasks in the SSE section?
- everything from
_mm_and_ps
until_mm_ucomineq_ss
- everything from
_mm_set_ss
until_mm_loadr_ps
For _mm_stream_ps
please annotate it with a link to #114
@nominolo done1
Note that the _mm256_cvtps_ph
AVX-1 instructions are missing from the list. These might require extra work since they operate on half-floats but Rust does not support them yet.
@gnzlbg Support for half-floats is provided by the half-rs
crate. In fact, half-rs
already exposes these LLVM intrinsics.
@GabrielMajeri Maybe I am misunderstanding the situation (so please correct me), but what I had in mind is that the vector types would need to be f16x8
(with a half-float upfront), so functions like extract
and insert
on those vector types would need to somehow deal with half-floats (I think it would be weird if extract on a f16x8
would return an f32
).
Also, 1195 ARM NEON intrinsics operate on half-float vectors directly as well (e.g. pub unsafe fn vmaxv_f16(a: f16x4) -> f16
) so this is something that we might need to get right anyways to support those.
@gnzlbg I wasn't aware of the situation on ARM.
Intel's recommendation on x86 is to only use half-floats to reduce memory bandwidth or improve space usage.
Doing any sort of actual operation on them is after you load them into float
or double
registers, which is why besides the packing / unpacking features there is no support for extracting certain values or anything like that.
It seems that ARM indeed has support for operating on the values, so there might be some more work involved there.
Intel's recommendation on x86 is to only use half-floats to reduce memory bandwidth or improve space usage.
I think that information might be slightly outdated. AVX-512 still doesn't have any instructions to directly operate on single f16
s, but AVX-512 4VNNIW (Vector Neural Network Instructions Word variable precision) adds some newer instructions for directly working on 16-bit float vectors.
Is this missing SSE4a intrinsics?
I have updated the parent post with:
- recently implemented SSE intrinsics (SSE is almost finished)
- list of MMX intrinsics (required to implement some of the missing SSE ones)
- list of SSE4a intrinsics
Alright we're actually quite close to finishing this off! I've upated the lists above with up-to-date instructions as well as an updated list of remaining intrinsics (all the checked off ones are removed for now).
Many of those functions are part of intel's SVML, and don't map to a single instruction. Do we intend to link stdsimd against Intel's library for those? I also don't think we should include them in the SSE/AVX sections as they're listed above.
@AdamNiederer oh interesting! I'm actually not sure what that is (SVML), could you expand a bit on what that is?
I was noticing though that some of the trigonometry-related functions weren't defined in either clang/gcc, which means we probably shouldn't be doing it just yet!
@alexcrichton long story short:
The Intelยฎ C++ Compiler provides short vector math library (SVML) intrinsics to compute vector math functions. ... The SVML intrinsics do not have any corresponding instructions. The prototypes for the SVML intrinsics are available in the immintrin.h file.
The SVML is just a bunch of inlining-friendly assembly-level subroutines which use SSE/AVX instructions to compute higher-level mathematical primitives. I'm pretty sure it's "just another library", otherwise. It's heavily optimized for Intel CPUs, much like ICC. I'm also pretty sure it's not open-source or readily available.
@alexcrichton , sse instructions are split into 3 folders: i586, i686 and x86_64. How should I know where to put an implementation for _mm_log2_pd
, for example? It is not obvious for me.
@crypto-universe @AdamNiederer ok cool, thanks for the info! Sounds like I should omit those intrinsics. I've updated the OP to omit the SVML intrinsics.
@crypto-universe oh the division between those modules is somewhat non-important now. The main one is that x86_64
is only compiled on 64-bit targets, but 32-bit targets compile both i586 and i686. If the intrinsic only works on x86_64 it should go there, otherwise either of the other modules is fine.
Ok I think this is effectively "done enough" that we can close and follow up with more specific issues if need be. Thanks so much for everyone's help on this!
Is this the right place to mention that core::arch is missing RISC-V support or should I open a tracking bug? (I'm specifically interested in adding support for the equivalent of rdtsc).
We generally try to stick to vendor-specified intrinsics, e.g. SSE intrinsics and ARM NEON intrinsics. AFAIK RISC-V doesn't have any target-specific intrinsics defined in GCC or Clang.
Ough. Thanks. I can see your reasoning, but that raises the bar by orders of magnitude and pushes the problem to all clients of core::arch :(
You can always just use inline assembly if you really want a specific instruction...
That's literally what "pushes the problem to all clients" means.
Probably best to just open a new issue where it can get eyes and discussion. The tail end of a long-closed issue isn't a good way to bring your problem to light.
@Amanieu It doesn't look like there are any RISC-V intrinsics in llvm/clang yet, but there is some recent work in that area: https://www.sifive.com/blog/risc-v-vector-extension-intrinsic-support
Those are actually much trickier than it seems since they involve scalable vectors with a size not known at compile-time. This requires special support in the compiler. The same issue applies to the ARM SVE intrinsics.
Out of interest and because it has recently become relevant: ["VMX"] would be helpful.
Maybe it's worth to open separate issue for each target feature? For example, I wanted to use _mm_stream_load_si128
and was quite surprised that std::arch::x86_64
does not have it.
Is there a reason why streaming load intrinsics were omitted?
Please open a new issue if there are any missing intrinsic.