Rotation code should use intrinsics on platforms other than MSVC

Question

Rotation code should use intrinsics on platforms other than MSVC

travisdowns opened this issue 7 years ago · 10 comments

As far as I can tell, compiler rotate intrinsics are only used on MSVC.

As mentioned here, however, most platforms offer those intrinsics at least for x86 in <x86intrin.h>, so it seems like they could be used when available.

Answer 1 · 2017-07-15T21:18:03.000Z

IIRC x86intrin.h isn't really standard; probably better to use immintrin.h (as https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=rotl&expand=4431 specifies)… that way at least we can stick to architecture-specific ifdefs instead of also having to check the compiler.

Answer 2 · 2017-07-15T21:44:38.000Z

Well it is covered somewhat in the answer. None of them are standard by any stretch of the imagination, but it seems that x86intrin.h is available on gcc, clang and icc. immintrin.h OTOH only seems to have the rotate instructions on icc. That's my summary based on a cursory read, anyway.

Answer 3 · 2017-07-15T22:17:55.000Z

Ugh. I'm pretty sure x86intrin.h isn't available on MSVC, probably other compilers as well. I guess we'll have to add checks on __GNUC__ and __INTEL_COMPILER.

Answer 4 · 2017-07-15T22:30:38.000Z

Yes, but that's the bread and butter of portable snippets, right? The work is done here so users don't have to do it :)

Answer 5 · 2017-07-17T06:24:08.000Z

Sure, but it means it will take some time to develop.

Unfortunately it's not just a matter of choosing the right file name, we also have to figure out when each compiler added support for a particular function, and thanks to PSNIP_BUILTIN_EMULATE_NATIVE we can't just whitelist functions one at a time since that may end up causing us to redefine an existing symbol. I'm starting to have serious regrets about that…

FWIW, while clang has x86intrin.h, there don't seem to be any MSVC-style intrinsics (like _rotl).

Answer 6 · 2017-07-17T18:59:55.000Z

You are right that clang up to the current version (4.0?) anyway doesn't seem to have any rotate intrinsic. I had seen it reported that they did, but it must have been an error or included from somewhere else.

Here's a good summary of the current issue which is that there is no C/C++ form of rotate that is both (a) free of UB and (b) recognized by both gcc and clang as a rotate (hence emitting ror or rol) without redundant code - but each compiler does that at least one form which is UB-free and recognized (it's just that those sets don't overlap between the two compilers).

Perhaps a simple solution is to simply use the clang recognized form when clang is used and the gcc recognized form when gcc is used (and to default one of the two the rest of the time).

Or you could still use the intrinsic for gcc , which apparently started supporting it sometime after 4.4.7 and before or at 4.5.3. It's probably simple enough just to be conservative and start supporting it at 4.5.3 or later. Or since they seem to be declared as macros you could just use an #ifdef to detect it? I suppose it is possible that they won't always be declared like that though.

Answer 7 · 2017-07-18T03:04:40.000Z

I guess inline assembly for x86 on compilers other than GCC and MSVC would be best… clang supports ACLE so for ARM on clang we can use __ror from ACLE 1.0 and __rorl/__rorll in 1.1. For a bit more fun, GCC doesn't support _rotl64/_rotr64, though they do have __rolq/__rorq.

I just pushed a commit (~~6b94730~~ 002250d) to the staging branch, how do you feel about that? I only did the right shift, but you should get the idea… ARM is untested; I'm not even sure if clang defines __ARM_ACLE.

Answer 8 · 2017-07-19T00:03:17.000Z

Mind if I re-purpose this issue to apply to all the MSVC intrinsics? There are a bunch of other functions which would benefit from the same analysis… for example, implementing _BitScanForward using __builtin_ctz isn't exactly optimal on x86.

Answer 9 · 2017-07-20T04:40:45.000Z

I think using the C rotate idiom that clang already recognizes is the best for clang:

unsigned rotl_c(unsigned a, unsigned b) {
	b &= 31; 
  	return (a << b) | (a >> (32 - b));
}

unsigned rotr_c(unsigned a, unsigned b) {
	b &= 31;
  	return (a >> b) | (a << (32 - b));
}

These translate directly into unadorned rol and rot instructions on platforms that support them, as shown in godbolt. This has the big advantage of "just working" on pretty much every clang platform that supports a rotate instruction (since the detection is generic), and not needing to write and test per-hardware asm solutions for every platform. It also serves as its own fallback: on platforms where there isn't a builtin recognized rotate, you'll get some OK shift-based code.

The code generated for the standalone function is ideal in clang 3.5 and later:

rotl_c(unsigned int, unsigned int):                            # @rotl_c(unsigned int, unsigned int)
        mov     ecx, esi
        rol     edi, cl
        mov     eax, edi
        ret

rotr_c(unsigned int, unsigned int):                            # @rotr_c(unsigned int, unsigned int)
        mov     ecx, esi
        ror     edi, cl
        mov     eax, edi
        ret

The two mov are unavoidable anyways due to the restrictions on the rol and ror instructions.

Nonw, in clang prior to 3.5 (back to 3.0, the earliest I could test on godbolt) it is still recognized, but the b &= 31 check is not optimized away, but this is going to be nearly free in most cases:

rotl_c(unsigned int, unsigned int):                            # @rotl_c(unsigned int, unsigned int)
        and     esi, 31
        mov     cl, sil
        rol     edi, cl
        mov     eax, edi
        ret

rotr_c(unsigned int, unsigned int):                            # @rotr_c(unsigned int, unsigned int)
        and     esi, 31
        mov     cl, sil
        ror     edi, cl
        mov     eax, edi
        ret

An extra cycle of latency from the rotate count (which is usually not the important chain, as it is often constant or available early), but otherwise still very fast.

In terms of code generation, inline asm has some issues (see point 6 here for example). The inline asm version compiles to the same good code as the C idiom (and never has the redundant and) for the standalone function, but once you integrate it into a real function, things change. Check out this example which just calls either the C or asm version to rotate every element by 13:

void rotate13_c(unsigned *data, size_t size) {
	for (unsigned *p = data; p < data + size; p++) {
		*p = rotl_c(*p, 13);
	}
}

The semantics of the C version is understood and clang is actually able to vectorize the entire loop (since there are no rotate SIMD intrinsics in SSE it internally implements the rotate using a shift!). The asm version is opaque to the compiler so no such magic is possible.

Let's turn off vectorization (aka magic) and just look at the non vectorized versions:

Main C loop:

        rol     dword ptr [rdi], 13
        rol     dword ptr [rdi + 4], 13
        rol     dword ptr [rdi + 8], 13
        rol     dword ptr [rdi + 12], 13
        add     rdi, 16
        cmp     rdi, rax
        jb      .LBB2_5

The compiler understood the constant 13 rotate, and also the semantics of the operation and unrolled the loop 4 times. It is able to use a memory RMW to do 4 rotates in 7 instructions, with a very low amount of total overhead (perhaps 1.5 fused uops per rotate).

The asm loop:

.LBB3_2:                                # =>This Inner Loop Header: Depth=1
        mov     edx, dword ptr [rdi]
        mov     cl, 13
        rol     edx, cl
        mov     ecx, edx
        mov     dword ptr [rdi], ecx
        add     rdi, 4
        cmp     rdi, rax
        jb      .LBB3_2

It is 8 instructions and it does a single operation per loop. It can still inline the function, but it otherwise does a bad job. It will never be able to use the "immediate" form of rol and it also does a bad job of register allocation (why is it moving the result from edx into ecx?). It's something like 9 fused ops per operation, and is likely to be at least twice as slow as the other option.

In fact all of the above applies to gcc also, which generates good code for this pattern. Older versions do have an extra and like clang but newer versions don't (starting with 4.6.4 the and disapears). So it seems like this is a good baseline to put in as the fallback, and avoids undef behavior and plays well with gcc and clang. It makes me question if we even need the intrinsics...

You are certainly welcome to use this for the "more instrinsics" bug too, although now we have a lot of rotate specific stuff :p - much of the above doesn't apply to other intrinsics which don't have the "recognized idiom" feature.

Answer 10 · 2020-04-21T12:09:38.000Z

In C++20 there are std::rotl and std::rotr. Unfortunately there's nothing like that in C