Tiered miscompilation of Vector2.Dot(x, x) without SSE4.1
PJB3005 opened this issue · 3 comments
Description
The following code miscompiles in tier1, only if SSE4.1 is not available (DOTNET_EnableSSE41=0
):
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector2 Asdf2(Vector2 totalForce)
{
return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
}
Reproduction Steps
Run this code in release with DOTNET_EnableSSE41=0
:
using System.Numerics;
using System.Runtime.CompilerServices;
public static class Program
{
public static void Main()
{
Vector2? result = null;
while (true)
{
var got = Asdf2(new Vector2(2, 3));
result ??= got;
Console.WriteLine(got);
if (got != result)
throw new InvalidOperationException();
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector2 Asdf2(Vector2 totalForce)
{
return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
}
}
Expected behavior
Program spins forever printing <13, 13>
Actual behavior
Tiered compilation kicks in and the result changes to <13, 0>
, tripping the exception.
Generated code:
; Assembly listing for method Program:Asdf2(System.Numerics.Vector2):System.Numerics.Vector2 (Tier1)
; Emitting BLENDED_CODE for generic X64 - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0x0000
movd xmm0, rcx
G_M000_IG02: ;; offset=0x0005
mulps xmm0, xmm0
haddps xmm0, xmm0
movd rax, xmm0
G_M000_IG03: ;; offset=0x0011
ret
; Total bytes of code 18
Regression?
Worked in .NET 7
Known Workarounds
Wrap the Vector2.Dot
call itself in NoInlining
.
Configuration
.NET 8: 8.0.1
OS: Windows 11 and Linux
Architecture: x64
Other information
Discovered through space-wizards/space-station-14#24008
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.
Issue Details
Description
The following code miscompiles in tier1, only if SSE4.1 is not available (DOTNET_EnableSSE41=0
):
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector2 Asdf2(Vector2 totalForce)
{
return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
}
Reproduction Steps
Run this code in release with DOTNET_EnableSSE41=0
:
using System.Numerics;
using System.Runtime.CompilerServices;
public static class Program
{
public static void Main()
{
Vector2? result = null;
while (true)
{
var got = Asdf2(new Vector2(2, 3));
result ??= got;
Console.WriteLine(got);
if (got != result)
throw new InvalidOperationException();
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector2 Asdf2(Vector2 totalForce)
{
return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
}
}
Expected behavior
Program spins forever printing <13, 13>
Actual behavior
Tiered compilation kicks in and the result changes to <13, 0>
, tripping the exception.
Generated code:
; Assembly listing for method Program:Asdf2(System.Numerics.Vector2):System.Numerics.Vector2 (Tier1)
; Emitting BLENDED_CODE for generic X64 - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0x0000
movd xmm0, rcx
G_M000_IG02: ;; offset=0x0005
mulps xmm0, xmm0
haddps xmm0, xmm0
movd rax, xmm0
G_M000_IG03: ;; offset=0x0011
ret
; Total bytes of code 18
### Regression?
Worked in .NET 7
### Known Workarounds
Wrap the `Vector2.Dot` call itself in `NoInlining`.
### Configuration
.NET 8: 8.0.1
OS: Windows 11 and Linux
Architecture: x64
### Other information
Discovered through https://github.com/space-wizards/space-station-14/pull/24008
<table>
<tr>
<th align="left">Author:</th>
<td>PJB3005</td>
</tr>
<tr>
<th align="left">Assignees:</th>
<td>-</td>
</tr>
<tr>
<th align="left">Labels:</th>
<td>
`area-CodeGen-coreclr`, `untriaged`
</td>
</tr>
<tr>
<th align="left">Milestone:</th>
<td>-</td>
</tr>
</table>
</details>
It looks like the vector multiplication meant to occur after the dot product has somehow disappeared at that step of compilation. JIT expects hadd to broadcast(?)
Looks like I introduced the bug in #81335
There is an optimization here that looks for Create(Dot(..., ...))
and Create(Sqrt(Dot(..., ...))
since we normally ensure the result is broadcast across the whole vector: https://github.com/dotnet/runtime/pull/81335/files#diff-5b83397bbbdd17bb9457998b520fdaaa474d165390985b66f32371561b6d0bac
This works fine for Vector3/4, but for Vector2 when we can't use dpps
it breaks down since we only do a single haddps
. We therefore need to preserve the Create
in the case SSE4.1 isn't available to ensure that we still broadcast the result to all elements.
#95568 is updating the Sum
implementation to no longer use haddps
and Dot
is itself effectively Sum(Mul(x, y))
, so we may want to simplify the logic here after that goes in and just implement it as Sum(Mul(x, y))
and have the pattern recognition in morph
look for Create(Sqrt(Sum(x)))
and Create(Sum(x))
instead. That should improve the perf and also fix the issue.
The backport can do the simpler fix of just avoiding the optimization in morph
when SSE4.1 isn't supported and we can't guarantee the broadcast is happening.