#pragma float_control(precise, on) doesn't work for SSE intrinsics

Question

#pragma float_control(precise, on) doesn't work for SSE intrinsics

obfuscated opened this issue 2 years ago · 1 comments

This is the link to godbolt with the full reproducer: https://godbolt.org/z/qYczcba39

The problem is that the pragma doesn't switch the mode when using intrinsics directly, but works when using the operators for the __m128 types.

I've originally discovered this in clang 14.0.1.

The code to see the problem is this (compiled with -Ofast -msse4.2 -mrecip=none):

__m128 func(__m128 d, float oldLen, float newLen) {
	#pragma float_control(precise, on)
	return _mm_div_ps(
		_mm_mul_ps(d, _mm_set1_ps(oldLen)),
		_mm_set1_ps(newLen)
	);
}

__m128 func1(__m128 d, float oldLen, float newLen) {
	#pragma float_control(precise, on)
	return d*oldLen/newLen;
}

And it leads to this assembly:

.LCPI1_0:
        .long   0x3f800000                      # float 1
func(float __vector(4), float, float):                         # @func(float __vector(4), float, float)
        shufps  xmm1, xmm1, 0                   # xmm1 = xmm1[0,0,0,0]
        mulps   xmm0, xmm1
        movss   xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero,zero,zero
        divss   xmm1, xmm2
        shufps  xmm1, xmm1, 0                   # xmm1 = xmm1[0,0,0,0]
        mulps   xmm0, xmm1
        ret
func1(float __vector(4), float, float):                        # @func1(float __vector(4), float, float)
        shufps  xmm1, xmm1, 0                   # xmm1 = xmm1[0,0,0,0]
        mulps   xmm0, xmm1
        shufps  xmm2, xmm2, 0                   # xmm2 = xmm2[0,0,0,0]
        divps   xmm0, xmm2
        ret

Generally the use of *(1/a) optimization here seems questionable and cland doesn't do it for scalars, only for vector/simd types. Is this another bug that needs to be reported separately?

Answer 1 · 2024-02-02T22:41:34.000Z

I'm seeing this problem still in clang v18.1.0 RC with the DirectXMath library. The only way to get my library to work on clang in Release mode is to NOT use /fp:fast.