dotnet/runtime

Consider expanding `Vector<T>` to support `nint` and `nuint`

tannergooding opened this issue ยท 30 comments

Rationale

Today Vector<T> supports the following 10 primitive types: byte, sbyte, short, ushort, int, uint, long, ulong, float, and double.

C# 9 is introducing support for nint and nuint which are variable sized integers matching the bitness of the underlying platform. That is, they are 32-bits on 32-bit systems and 64-bits on 64-bit systems.

As such, it may be beneficial to expand Vector<T> to additionally support these types so we can get rid of the using aliases and support performing the cross-platform vector operations using these new primitive types.

Proposal

Extend Vector<T> to support nint and nuint as valid primitive types. This will extend a number of existing generic functions which take a Vector<T> to also support taking the new types rather than throwing a PlatformNotSupportedException.

Additionally, the following non-generic APIs should be added for parity with the existing surface area:

namespace System.Numerics
{
    public partial struct Vector<T>
    {
        public static explicit operator Vector<nint>(Vector<T> value);
        public static explicit operator Vector<nuint>(Vector<T> value);
    }

    public static partial class Vector
    {
        public static Vector<nint> AsVectorIntPtr<T>(Vector<T> value);
        public static Vector<nuint> AsVectorUIntPtr<T>(Vector<T> value);

        public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
    }
}

Other Considerations

For API names, we have a guideline that states to use the framework name rather than the language keyword name (e.g. Int32 and Int64 rather than int and long). However, the framework name for nint is IntPtr but the operators exposed and general use case for the two types is somewhat different, as such a name such as NInt and NUInt may be a better alternative. Whatever we choose, it should likely become the standard for nint moving forward.

The same request could be made for System.Runtime.Intrinsics, but the API bloat for this would be much larger and would need further consideration. It might be worthwhile to allow nint/nuint as valid T without exposing the additional overloads initially as that would at least unblock users from providing such APIs themselves.

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

CC. @pgovind, @CarolEidt, @echesakovMSFT

We have an existing use case in the runtime in Utf16Utility.Validation.cs where we are using the following to workaround not having this:

#if TARGET_64BIT
using nuint_t = System.UInt64;
#else
using nuint_t = System.UInt32;
#endif

Could you also support any reference types?
This could be so much powerful.(e.g. Collection look up)
Ofcource, we could not cast other type vectors to reference type vector.

Could you also support any reference types?

As I understand it, that would be a non-trivial ask and likely not for a lot of benefit. Many of the operations exposed aren't valid or don't make sense for reference types and the JIT doesn't support tracking GC types in the SIMD registers.
There are likely other considerations I'm not aware of as well and which would best be answered by someone like @jkotas or @Maoni0.

I wonder if we could optimistically handle fast collection lookup anyway, even without support for reference types in SIMD. Consider that you have a T[] (where T is a reference type), and you want to quickly look up the index of any given T within the array. Assume referential equality checks, not deep equality checks.

The algorithm would then be as follows:

  1. Pin the T instance that represents your search target, then project this T to a void* then to nuint.
  2. Project the underlying T[] as a Span<nuint>.
  3. Perform a vectorized search within the Span<nuint> looking for the "target" nuint.

Since the target is pinned, it cannot be moved by the GC, so you don't have to worry about having the GC track individual elements within the SIMD registers. It's possible that the GC might move other elements of the array while you're inspecting it, but that's ok since neither their original addresses nor their modified addresses will match the pinned address of the target you're seeking. This means you're guaranteed zero false positives and zero false negatives.

Again, this would only work for referential equality checks. Since I'm not sure how common referential vs. deep equality checks are I'm not sure how much benefit this would offer in practice.

BTW,maybe we also want support for enum.

  1. Pin the T instance that represents your search target, then project this T to a void* then to nuint.

Maybe we need some compiler specific support or JIT intrinsic for done this effectively.
Such as

IntPtr e0,e1,e2,e3;//fixed-pointers
Unsafe.WriteUnaligned(ref Unsafe.As<IntPtr,byte>(ref e0),Unsafe.ReadUnaligned<Vector<nint>>(ref Unsafe.As<T,byte>(ref array[0])));
var vector = Unsafe.ReadUnaligned<Vector<nint>>(ref e0);

C# going to have record type feature.
So compiler can provide detailed,and structured information about it's equality.
If Vector<object> is supported (or work around with nint works), we could compare deep-equality by SIMD.

This is a pseudo code.

[EquatabilityContract]
[CompilerGenerated]
class Record
{
   [EquatabilityMember]
   long id;
}
Record[] records;
Record Find(long id)
{
   for(var i = 0;i + Vector256<object>.Count <records.Length;i+=Vector256<object>.Count)
   {
      var vector = Unsafe.ReadUnaligned<Vector256<object>>(ref records[i]);
      var ids = Avx2.GatherVector256<long>(JitHelper.OffsetOf<Record>(tokenof(id)),vector);//"tokenof" means member definition reference in IL.Compiler could generate it
      var compare = Avx2.CompareEqual(vector,Vector256.Create(id));
      var mm = Avx2.MoveMask(compare.AsBytes());
      if(mm!=0)
      {
         return Unsafe.As<Record>(vector.GetElement(mm/IntPtr.Size));
      }
   }
}

Video

  • We should use NInt and NUInt as the the non-keyword types names for APIs that have to refer to nint and nuint, as opposed to IntPtr and UIntPtr,
namespace System.Numerics
{
    public partial struct Vector<T>
    {
        public static explicit operator Vector<nint>(Vector<T> value);
        public static explicit operator Vector<nuint>(Vector<T> value);
    }

    public static partial class Vector
    {
        public static Vector<nint> AsVectorNInt<T>(Vector<T> value);
        public static Vector<nuint> AsVectorNUInt<T>(Vector<T> value);

        public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
    }
}

@jkotas, @CarolEidt: How important is maintaining perf of Vector<T> for platforms without acceleration (such as ARM32)?

Today, much of the code is generated via T4 templates and internally uses the Register struct, which is explicit layout with overlapping data.

However, given the existence of System.Runtime.CompilerServices.Unsafe we could remove the T4 templates and rewrite the software fallback logic to be generic loops instead which would greatly simplify the code and make it easier to insert the nint and nuint support, which doesn't work as nicely with the T4 setup.

  • The T4 templates also rarely work with the current S.P.Corelib setup and so the benefit of it is basically non-existent today

This would also remove duplication between the IsHardwareAccelerated and fallback paths, where the former uses for loops and the latter is manually unrolled.

  • The IsHardwareAccelerated path is relying on the basic loop unrolling support that exists for Vector<T>.Count, which we could enable for ARM32 if perf is a concern.

Do you have any numbers for how much regression we would potentially see on ARM32?

It would impact Mono target platforms too. @marek-safar How much do we care about non-accelerated Vector<T> performance on Mono?

Do you have any numbers for how much regression we would potentially see on ARM32?

I don't have any currently but I can try and get some.

I imagine some parts would be improved, especially in user code, due to less data being zeroed (its currently a 16-byte, 66 field explicit layout struct). But some parts, namely the methods in S.P.Corelib, would likely regress due to running a for loop, rather than a manually unrolled loop (although several functions aren't unrolled today).

How much do we care about non-accelerated Vector performance on Mono?

@jkotas we don't care about performance for that config

Do you have any numbers for how much regression we would potentially see on ARM32?

@jkotas, it looks like the worst case is about a 4x* perf regression (for byte which has 16 elements). This only looks to be the case on ARM32 where the loop unrolling support around Vector<T>.Count doesn't exist.

  • Some, like LessThanOrEqualAll is actually closer to 6x, but this is because it does LessThan || Equals, rather than doing an efficient "single pass".

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores

Current:

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
CountBenchmark 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns - - - -
OneBenchmark 11.6882 ns 0.0298 ns 0.0264 ns 11.6883 ns 11.6550 ns 11.7447 ns - - - -
ZeroBenchmark 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns 0.0000 ns - - - -
EqualsBenchmark 3.7958 ns 0.0291 ns 0.0258 ns 3.8016 ns 3.7472 ns 3.8370 ns - - - -
GetHashCodeBenchmark 99.3895 ns 1.4870 ns 1.2417 ns 99.2845 ns 97.4959 ns 101.7576 ns - - - -
AddOperatorBenchmark 14.7578 ns 0.0843 ns 0.0748 ns 14.7397 ns 14.6672 ns 14.9465 ns - - - -
BitwiseAndOperatorBenchmark 12.2149 ns 0.0902 ns 0.0800 ns 12.2393 ns 12.0871 ns 12.3594 ns - - - -
BitwiseOrOperatorBenchmark 9.1529 ns 0.0340 ns 0.0302 ns 9.1471 ns 9.1069 ns 9.2057 ns - - - -
DivisionOperatorBenchmark 79.2174 ns 0.0941 ns 0.0786 ns 79.1944 ns 79.1173 ns 79.3931 ns - - - -
EqualityOperatorBenchmark 0.7216 ns 0.0280 ns 0.0248 ns 0.7208 ns 0.6867 ns 0.7715 ns - - - -
ExclusiveOrOperatorBenchmark 9.1752 ns 0.0468 ns 0.0438 ns 9.1527 ns 9.1113 ns 9.2425 ns - - - -
InequalityOperatorBenchmark 3.9465 ns 0.0683 ns 0.0605 ns 3.9228 ns 3.8851 ns 4.0766 ns - - - -
MultiplyOperatorBenchmark 15.1731 ns 0.0715 ns 0.0633 ns 15.1714 ns 15.0914 ns 15.3134 ns - - - -
OnesComplementOperatorBenchmark 21.1975 ns 0.0804 ns 0.0671 ns 21.2021 ns 21.0049 ns 21.2608 ns - - - -
SubtractionOperatorBenchmark 15.2344 ns 0.0691 ns 0.0613 ns 15.2315 ns 15.1622 ns 15.3523 ns - - - -
UnaryNegateOperatorBenchmark 17.3482 ns 0.0274 ns 0.0243 ns 17.3372 ns 17.3272 ns 17.4067 ns - - - -
AbsBenchmark 4.8659 ns 0.0293 ns 0.0259 ns 4.8537 ns 4.8366 ns 4.9118 ns - - - -
AddBenchmark 18.1764 ns 0.0569 ns 0.0504 ns 18.1810 ns 18.1020 ns 18.2635 ns - - - -
AndNotBenchmark 29.3541 ns 0.1666 ns 0.1558 ns 29.3400 ns 29.1411 ns 29.6693 ns - - - -
BitwiseAndBenchmark 10.0123 ns 0.0400 ns 0.0355 ns 10.0063 ns 9.9612 ns 10.0709 ns - - - -
BitwiseOrBenchmark 10.0021 ns 0.1060 ns 0.0991 ns 9.9950 ns 9.8593 ns 10.1830 ns - - - -
ConditionalSelectBenchmark 49.0080 ns 0.3078 ns 0.2879 ns 49.0135 ns 48.4501 ns 49.5042 ns - - - -
DivideBenchmark 78.8175 ns 0.2677 ns 0.2373 ns 78.7779 ns 78.4765 ns 79.2487 ns - - - -
DotBenchmark 15.2722 ns 0.0261 ns 0.0231 ns 15.2709 ns 15.2363 ns 15.3179 ns - - - -
EqualsStaticBenchmark 22.6560 ns 0.1268 ns 0.1186 ns 22.6159 ns 22.5144 ns 22.9051 ns - - - -
EqualsAllBenchmark 4.1479 ns 0.0527 ns 0.0467 ns 4.1447 ns 4.0745 ns 4.2333 ns - - - -
EqualsAnyBenchmark 29.4765 ns 0.0926 ns 0.0866 ns 29.4362 ns 29.3856 ns 29.6265 ns - - - -
GreaterThanBenchmark 21.7368 ns 0.0631 ns 0.0559 ns 21.7294 ns 21.6671 ns 21.8384 ns - - - -
GreaterThanAllBenchmark 32.9013 ns 0.0527 ns 0.0493 ns 32.9126 ns 32.8253 ns 32.9649 ns - - - -
GreaterThanAnyBenchmark 28.0047 ns 0.0764 ns 0.0714 ns 27.9833 ns 27.9141 ns 28.1723 ns - - - -
GreaterThanOrEqualBenchmark 52.3921 ns 0.3169 ns 0.2646 ns 52.2974 ns 52.1671 ns 53.0578 ns - - - -
GreaterThanOrEqualAllBenchmark 60.9816 ns 0.1275 ns 0.1065 ns 60.9825 ns 60.7853 ns 61.2152 ns - - - -
GreaterThanOrEqualAnyBenchmark 58.1081 ns 0.1660 ns 0.1386 ns 58.0799 ns 57.8858 ns 58.3585 ns - - - -
LessThanBenchmark 20.2578 ns 0.1129 ns 0.1000 ns 20.2367 ns 20.0859 ns 20.4389 ns - - - -
LessThanAllBenchmark 33.3310 ns 0.1187 ns 0.0991 ns 33.3558 ns 33.1460 ns 33.5284 ns - - - -
LessThanAnyBenchmark 25.5108 ns 0.0980 ns 0.0869 ns 25.5065 ns 25.3958 ns 25.6639 ns - - - -
LessThanOrEqualBenchmark 49.8892 ns 0.1327 ns 0.1176 ns 49.8391 ns 49.7525 ns 50.1483 ns - - - -
LessThanOrEqualAllBenchmark 60.0621 ns 0.1442 ns 0.1349 ns 60.0328 ns 59.8471 ns 60.2880 ns - - - -
LessThanOrEqualAnyBenchmark 55.2470 ns 0.0496 ns 0.0388 ns 55.2529 ns 55.1691 ns 55.3128 ns - - - -
MaxBenchmark 21.8661 ns 0.0363 ns 0.0322 ns 21.8606 ns 21.8211 ns 21.9399 ns - - - -
MinBenchmark 21.0789 ns 0.0779 ns 0.0691 ns 21.0662 ns 20.9944 ns 21.2350 ns - - - -
MultiplyBenchmark 18.0297 ns 0.0527 ns 0.0467 ns 18.0289 ns 17.9634 ns 18.1398 ns - - - -
NegateBenchmark 18.5310 ns 0.0626 ns 0.0555 ns 18.5090 ns 18.4585 ns 18.6348 ns - - - -
OnesComplementBenchmark 20.7130 ns 0.1875 ns 0.1754 ns 20.7374 ns 20.3737 ns 20.9449 ns - - - -
SquareRootBenchmark 34.9111 ns 0.0887 ns 0.0786 ns 34.8872 ns 34.8279 ns 35.0818 ns - - - -
SubtractBenchmark 18.0638 ns 0.0379 ns 0.0336 ns 18.0603 ns 17.9950 ns 18.1135 ns - - - -
XorBenchmark 10.2635 ns 0.1394 ns 0.1304 ns 10.2707 ns 10.0667 ns 10.5332 ns - - - -

Generic Unsafe implementation:
// * Summary *

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=6.0.100-alpha.1.20459.9
[Host] : .NET Core 6.0.0 (CoreCLR 6.0.20.45424, CoreFX 6.0.20.45424), Arm RyuJIT
Job-XVEAND : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), Arm RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
WarmupCount=1

Method Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
CountBenchmark 0.6085 ns 0.0292 ns 0.0259 ns 0.6163 ns 0.5516 ns 0.6352 ns - - - -
OneBenchmark 15.0520 ns 0.0734 ns 0.0687 ns 15.0311 ns 14.9681 ns 15.1906 ns - - - -
ZeroBenchmark 0.7316 ns 0.1004 ns 0.1156 ns 0.7163 ns 0.5323 ns 0.9691 ns - - - -
EqualsBenchmark 4.0958 ns 0.0461 ns 0.0408 ns 4.0859 ns 4.0474 ns 4.1932 ns - - - -
GetHashCodeBenchmark 308.9952 ns 2.5950 ns 2.3004 ns 308.3442 ns 306.2050 ns 314.2366 ns - - - -
AddOperatorBenchmark 61.4049 ns 0.0881 ns 0.0781 ns 61.3829 ns 61.2802 ns 61.5378 ns - - - -
BitwiseAndOperatorBenchmark 8.7401 ns 0.0907 ns 0.0804 ns 8.7529 ns 8.4814 ns 8.8201 ns - - - -
BitwiseOrOperatorBenchmark 5.8710 ns 0.0195 ns 0.0163 ns 5.8664 ns 5.8459 ns 5.8993 ns - - - -
DivisionOperatorBenchmark 243.7459 ns 6.7311 ns 7.7516 ns 244.9533 ns 227.9078 ns 257.7710 ns - - - -
EqualityOperatorBenchmark 0.9734 ns 0.0154 ns 0.0136 ns 0.9692 ns 0.9549 ns 1.0001 ns - - - -
ExclusiveOrOperatorBenchmark 5.9003 ns 0.0257 ns 0.0228 ns 5.8996 ns 5.8748 ns 5.9447 ns - - - -
InequalityOperatorBenchmark 3.5203 ns 0.0322 ns 0.0285 ns 3.5061 ns 3.4928 ns 3.5887 ns - - - -
MultiplyOperatorBenchmark 61.9186 ns 0.1388 ns 0.1159 ns 61.9120 ns 61.7196 ns 62.1257 ns - - - -
OnesComplementOperatorBenchmark 21.6788 ns 0.1246 ns 0.1105 ns 21.6766 ns 21.5047 ns 21.8764 ns - - - -
SubtractionOperatorBenchmark 60.1066 ns 0.2208 ns 0.1844 ns 60.1600 ns 59.8257 ns 60.3687 ns - - - -
UnaryNegateOperatorBenchmark 67.4316 ns 0.2463 ns 0.2183 ns 67.3973 ns 67.1599 ns 67.8942 ns - - - -
AbsBenchmark 10.3173 ns 0.1574 ns 0.1472 ns 10.2980 ns 10.0591 ns 10.5487 ns - - - -
AddBenchmark 65.3602 ns 0.1875 ns 0.1662 ns 65.3597 ns 65.0336 ns 65.6387 ns - - - -
AndNotBenchmark 22.8476 ns 0.0983 ns 0.0872 ns 22.8684 ns 22.6440 ns 22.9391 ns - - - -
BitwiseAndBenchmark 10.1515 ns 0.1625 ns 0.1520 ns 10.1560 ns 9.8568 ns 10.3946 ns - - - -
BitwiseOrBenchmark 8.7920 ns 0.0685 ns 0.0641 ns 8.7941 ns 8.6943 ns 8.9188 ns - - - -
ConditionalSelectBenchmark 35.1013 ns 0.0871 ns 0.0814 ns 35.0999 ns 34.9701 ns 35.2477 ns - - - -
DivideBenchmark 230.8056 ns 3.8200 ns 3.5733 ns 231.1309 ns 223.5282 ns 236.5047 ns - - - -
DotBenchmark 106.8848 ns 1.5138 ns 1.4160 ns 106.6555 ns 104.8663 ns 109.8214 ns - - - -
EqualsStaticBenchmark 38.7218 ns 0.1433 ns 0.1341 ns 38.6835 ns 38.5266 ns 38.9829 ns - - - -
EqualsAllBenchmark 3.5746 ns 0.0356 ns 0.0316 ns 3.5639 ns 3.5331 ns 3.6283 ns - - - -
EqualsAnyBenchmark 57.1994 ns 0.2407 ns 0.2134 ns 57.1643 ns 56.9201 ns 57.6293 ns - - - -
GreaterThanBenchmark 36.8960 ns 0.0887 ns 0.0786 ns 36.8852 ns 36.7171 ns 37.0227 ns - - - -
GreaterThanAllBenchmark 105.6515 ns 0.7164 ns 0.6351 ns 105.6380 ns 104.6500 ns 106.7682 ns - - - -
GreaterThanAnyBenchmark 94.5897 ns 1.4236 ns 1.3316 ns 94.5560 ns 92.3444 ns 97.3495 ns - - - -
GreaterThanOrEqualBenchmark 144.2161 ns 0.3195 ns 0.2832 ns 144.1130 ns 143.8592 ns 144.8957 ns - - - -
GreaterThanOrEqualAllBenchmark 225.7534 ns 0.7733 ns 0.6457 ns 225.8675 ns 224.5361 ns 226.7384 ns - - - -
GreaterThanOrEqualAnyBenchmark 214.1536 ns 1.0426 ns 0.9752 ns 213.6999 ns 213.0955 ns 216.1802 ns - - - -
LessThanBenchmark 34.9333 ns 0.1168 ns 0.1036 ns 34.9611 ns 34.7754 ns 35.1236 ns - - - -
LessThanAllBenchmark 162.8080 ns 1.7153 ns 1.6045 ns 162.2848 ns 160.3714 ns 165.8838 ns - - - -
LessThanAnyBenchmark 44.2883 ns 0.4552 ns 0.4258 ns 44.3683 ns 43.1221 ns 44.9381 ns - - - -
LessThanOrEqualBenchmark 183.8210 ns 0.4592 ns 0.4071 ns 183.8070 ns 183.3386 ns 184.5908 ns - - - -
LessThanOrEqualAllBenchmark 401.9115 ns 8.9488 ns 10.3055 ns 402.8547 ns 384.7896 ns 423.1362 ns - - - -
LessThanOrEqualAnyBenchmark 265.0211 ns 3.5067 ns 3.2801 ns 264.9297 ns 260.6478 ns 270.4116 ns - - - -
MaxBenchmark 40.3965 ns 0.1073 ns 0.1004 ns 40.4098 ns 40.1970 ns 40.5682 ns - - - -
MinBenchmark 39.0784 ns 0.1081 ns 0.0959 ns 39.0810 ns 38.9268 ns 39.2049 ns - - - -
MultiplyBenchmark 63.0806 ns 0.2723 ns 0.2547 ns 63.0654 ns 62.6993 ns 63.6056 ns - - - -
NegateBenchmark 63.1924 ns 0.1658 ns 0.1550 ns 63.2034 ns 62.9797 ns 63.5404 ns - - - -
OnesComplementBenchmark 22.2950 ns 0.1434 ns 0.1271 ns 22.3171 ns 21.9820 ns 22.4744 ns - - - -
SquareRootBenchmark 129.6186 ns 0.5511 ns 0.4885 ns 129.5628 ns 128.9000 ns 130.5883 ns - - - -
SubtractBenchmark 66.1447 ns 0.3725 ns 0.3484 ns 66.1251 ns 65.3131 ns 66.5934 ns - - - -
XorBenchmark 9.4515 ns 0.1190 ns 0.1113 ns 9.4744 ns 9.2758 ns 9.6036 ns - - - -

Thanks for collecting the data. I am supportive of your proposal to simplify the template.

I think that removing the T4 templates is the right thing to do, but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.

but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.

I'll take a look and see if adding the loop unrolling support is trivial. From what I recall on doing it for Vector128<T>.Count, it might just be treating Count as intrinsic and ensuring the right flag is set.

@tannergooding - that would be good, but perhaps that could also be done as a "cleanup" PR that includes the work to figure out why the HFA classification requires the Register type.

So we have the SIMDHandlesCache which is set based on matching the name of the generic type: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330.

However, we then map that later based on just the simdBaseType:

However, TYP_I_MPL and TYP_U_IMPL are themselves merely aliases for TYP_LONG/TYP_INT and TYP_ULONG/TYP_UINT, respectively; so the caching logic breaks down: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/vartype.h#L35-L50

Is there some way to get the handle for Vector<ulong>/Vector<uint> when we encounter a Vector<nuint> so the cache won't be invalid during later lookups or is there someway to track that a given TYP_ULONG is actually a TYP_U_IMPL (existing support that is)?

CC. @CarolEidt, @dotnet/jit-contrib

@jkotas, is there currently a JIT/EE method that can be used to resolve a handle? This would end up being "pay for play" as it would only be resolved once and only if a Vector<nint> or Vector<nuint> was the used handle.

I was also looking at finishing connecting the m_layout field as an alternative, but there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle.

Does getTypeInstantiationArgument method on JIT/EE help?

I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.

there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle

That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that. I confess I haven't investigated this in detail, but it doesn't seem like there's any other reasonable way to get class info if you don't have a handle (getTypeInstantiationArgument requires a handle).

Does getTypeInstantiationArgument method on JIT/EE help?

I don't think so. The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330
So what we have is a Vector`1[System.IntPtr], however we only track the simdBaseType in the created node, which will be TYP_I_IMPL, which is just an alias for TYP_LONG

This means that when we later try and get the handle back, such as in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.cpp#L17354-L17356, we will try to get a handle for Vector`1[System.Int64], which doesn't exist.

I was hoping there was some way to, for example, from a class handle for Vector`1[System.IntPtr], resolve the class handle for Vector`1[System.Int64]. This would largely solve some of the handle lookup problems we have both for this case and cases where we are returning a type that hasn't been resolved (say for example, we internally create a node that temporarily treats a Vector`1[System.Single] as a Vector`1[System.Int32] -- we do this or similar in a few places).

I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.
That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that.

I think this, in general, probably needs a bit of thought.

We also have a couple issues around rewriting intrinsics into user calls in lowering (such as to better handle operands that later become constants) in which case I think we want to carry a CORINFO_METHOD_HANDLE instead (or in addition to). Like GenTreeIntrinsic does: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.h#L4743-L4748 and with the rewriting: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/rationalize.cpp#L265-L284)

I believe you can get the ClassLayout for a given return type or parameter from the method handle, but I think we'd also have the issue where we may not have a method handle (such as for an intrinsic node created to handle a helper method, for example).

The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330

BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.

We can add JIT/EE interface methods that let you create new generic instantiations if it helps JIT to represent things. JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations. Otherwise, it would cause problems for AOT.

BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.

This is what getTypeInstantiationArgument is for, correct? It looks like we are actually using that for the Vector64/128/256<T> path: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L578...

Is it prohibitively expensive if we used it exclusively or would we still need some caching layer to help?

JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations

I think that's reasonable and don't believe we are using the handles for anything like that today.

What I'd like to do, ideally, is come to some middle ground between where we are today with the SIMDHandleCache and where we'd like to be with the ClassLayout info. I believe the latter is quite a bit more work, while the former would at least unblock this issue and related scenarios around accelerating Quaternion/Plane (which are functionally a Vector4 for most operations) and implementing the generic Vector2/3/4<T> types.

It sounds like we might be able to do something like:

  1. Fixup the Vector<T> cache to use getTypeInstantiationArgument, which resolves the name lookup but keeps the mismatch
  2. Track the CORINFO_TYPE_* rather than the TYP_ for the simdBaseType (it is trivial to go from the former to the latter)

This would allow us to track the proper base type and utilize Vector<T> for nint and nuint. We could then continue the discussions around how to handle ClassLayout and and if we can have some "synthetic" class layout like @CarolEidt suggested.

I got most of the changes done (it rounds out to +812, -722 lines, approx), but ran into a bit of "snag".

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

This isn't hard to handle, but I was wondering if someone could explain the reason for the difference here? At first glance it looks like a bug, but I'm guessing there is some reason for it.
I don't see any surrounding comments, nor comments in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/corinfo.h or https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/typelist.h

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

Isn't this related to the fact that CLI models only 3 integral types on the computation stack - int32, native int and int64. Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?

Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?

The base type is distinct from the node type and while they often match up, there are many cases where they don't.

Isn't this related to the fact that CLI models only 3 integral types on the computation stack

That was my initial thought, but byte, ubyte, short, and ushort not mapping to int ends up not lining up with that and much of the JIT logic uses genActualType() to convert a var_types to the stack type, where that is relevant.

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

I'm not sure why this is the case. It does seem quite inconsistent, but I would bet that there's JIT code somewhere that depends on it.