#pragma float_control(precise, on) doesn't work for SSE intrinsics
obfuscated opened this issue · 1 comments
This is the link to godbolt with the full reproducer: https://godbolt.org/z/qYczcba39
The problem is that the pragma doesn't switch the mode when using intrinsics directly, but works when using the operators for the __m128
types.
I've originally discovered this in clang 14.0.1.
The code to see the problem is this (compiled with -Ofast -msse4.2 -mrecip=none
):
__m128 func(__m128 d, float oldLen, float newLen) {
#pragma float_control(precise, on)
return _mm_div_ps(
_mm_mul_ps(d, _mm_set1_ps(oldLen)),
_mm_set1_ps(newLen)
);
}
__m128 func1(__m128 d, float oldLen, float newLen) {
#pragma float_control(precise, on)
return d*oldLen/newLen;
}
And it leads to this assembly:
.LCPI1_0:
.long 0x3f800000 # float 1
func(float __vector(4), float, float): # @func(float __vector(4), float, float)
shufps xmm1, xmm1, 0 # xmm1 = xmm1[0,0,0,0]
mulps xmm0, xmm1
movss xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero,zero,zero
divss xmm1, xmm2
shufps xmm1, xmm1, 0 # xmm1 = xmm1[0,0,0,0]
mulps xmm0, xmm1
ret
func1(float __vector(4), float, float): # @func1(float __vector(4), float, float)
shufps xmm1, xmm1, 0 # xmm1 = xmm1[0,0,0,0]
mulps xmm0, xmm1
shufps xmm2, xmm2, 0 # xmm2 = xmm2[0,0,0,0]
divps xmm0, xmm2
ret
Generally the use of *(1/a)
optimization here seems questionable and cland doesn't do it for scalars, only for vector/simd types. Is this another bug that needs to be reported separately?
I'm seeing this problem still in clang v18.1.0 RC with the DirectXMath library. The only way to get my library to work on clang in Release mode is to NOT use /fp:fast
.