dotnet/runtime

Tiered miscompilation of Vector2.Dot(x, x) without SSE4.1

PJB3005 opened this issue · 3 comments

Description

The following code miscompiles in tier1, only if SSE4.1 is not available (DOTNET_EnableSSE41=0):

    [MethodImpl(MethodImplOptions.NoInlining)]
    static Vector2 Asdf2(Vector2 totalForce)
    {
        return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
    }

Reproduction Steps

Run this code in release with DOTNET_EnableSSE41=0:

using System.Numerics;
using System.Runtime.CompilerServices;

public static class Program
{
    public static void Main()
    {
        Vector2? result = null;
        while (true)
        {
            var got = Asdf2(new Vector2(2, 3));
            result ??= got;
            Console.WriteLine(got);
            if (got != result)
                throw new InvalidOperationException();
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static Vector2 Asdf2(Vector2 totalForce)
    {
        return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
    }
}

Expected behavior

Program spins forever printing <13, 13>

Actual behavior

Tiered compilation kicks in and the result changes to <13, 0>, tripping the exception.

Generated code:

; Assembly listing for method Program:Asdf2(System.Numerics.Vector2):System.Numerics.Vector2 (Tier1)
; Emitting BLENDED_CODE for generic X64 - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0x0000
       movd     xmm0, rcx

G_M000_IG02:                ;; offset=0x0005
       mulps    xmm0, xmm0
       haddps   xmm0, xmm0
       movd     rax, xmm0
 
G_M000_IG03:                ;; offset=0x0011
       ret

; Total bytes of code 18

Regression?

Worked in .NET 7

Known Workarounds

Wrap the Vector2.Dot call itself in NoInlining.

Configuration

.NET 8: 8.0.1
OS: Windows 11 and Linux
Architecture: x64

Other information

Discovered through space-wizards/space-station-14#24008

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The following code miscompiles in tier1, only if SSE4.1 is not available (DOTNET_EnableSSE41=0):

    [MethodImpl(MethodImplOptions.NoInlining)]
    static Vector2 Asdf2(Vector2 totalForce)
    {
        return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
    }

Reproduction Steps

Run this code in release with DOTNET_EnableSSE41=0:

using System.Numerics;
using System.Runtime.CompilerServices;

public static class Program
{
    public static void Main()
    {
        Vector2? result = null;
        while (true)
        {
            var got = Asdf2(new Vector2(2, 3));
            result ??= got;
            Console.WriteLine(got);
            if (got != result)
                throw new InvalidOperationException();
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static Vector2 Asdf2(Vector2 totalForce)
    {
        return Vector2.Dot(totalForce, totalForce) * new Vector2(1, 1);
    }
}

Expected behavior

Program spins forever printing <13, 13>

Actual behavior

Tiered compilation kicks in and the result changes to <13, 0>, tripping the exception.

Generated code:

; Assembly listing for method Program:Asdf2(System.Numerics.Vector2):System.Numerics.Vector2 (Tier1)
; Emitting BLENDED_CODE for generic X64 - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data

G_M000_IG01:                ;; offset=0x0000
       movd     xmm0, rcx

G_M000_IG02:                ;; offset=0x0005
       mulps    xmm0, xmm0
       haddps   xmm0, xmm0
       movd     rax, xmm0
 
G_M000_IG03:                ;; offset=0x0011
       ret

; Total bytes of code 18

### Regression?

Worked in .NET 7

### Known Workarounds

Wrap the `Vector2.Dot` call itself in `NoInlining`.

### Configuration

.NET 8: 8.0.1
OS: Windows 11 and Linux
Architecture: x64


### Other information

Discovered through https://github.com/space-wizards/space-station-14/pull/24008

<table>
  <tr>
    <th align="left">Author:</th>
    <td>PJB3005</td>
  </tr>
  <tr>
    <th align="left">Assignees:</th>
    <td>-</td>
  </tr>
  <tr>
    <th align="left">Labels:</th>
    <td>

`area-CodeGen-coreclr`, `untriaged`

</td>
  </tr>
  <tr>
    <th align="left">Milestone:</th>
    <td>-</td>
  </tr>
</table>
</details>

It looks like the vector multiplication meant to occur after the dot product has somehow disappeared at that step of compilation. JIT expects hadd to broadcast(?)

Looks like I introduced the bug in #81335

There is an optimization here that looks for Create(Dot(..., ...)) and Create(Sqrt(Dot(..., ...)) since we normally ensure the result is broadcast across the whole vector: https://github.com/dotnet/runtime/pull/81335/files#diff-5b83397bbbdd17bb9457998b520fdaaa474d165390985b66f32371561b6d0bac

This works fine for Vector3/4, but for Vector2 when we can't use dpps it breaks down since we only do a single haddps. We therefore need to preserve the Create in the case SSE4.1 isn't available to ensure that we still broadcast the result to all elements.

#95568 is updating the Sum implementation to no longer use haddps and Dot is itself effectively Sum(Mul(x, y)), so we may want to simplify the logic here after that goes in and just implement it as Sum(Mul(x, y)) and have the pattern recognition in morph look for Create(Sqrt(Sum(x))) and Create(Sum(x)) instead. That should improve the perf and also fix the issue.

The backport can do the simpler fix of just avoiding the optimization in morph when SSE4.1 isn't supported and we can't guarantee the broadcast is happening.