Consider expanding `Vector<T>` to support `nint` and `nuint`

Question

Consider expanding `Vector<T>` to support `nint` and `nuint`

tannergooding opened this issue 5 years ago · 30 comments

Rationale

Today Vector<T> supports the following 10 primitive types: byte, sbyte, short, ushort, int, uint, long, ulong, float, and double.

C# 9 is introducing support for nint and nuint which are variable sized integers matching the bitness of the underlying platform. That is, they are 32-bits on 32-bit systems and 64-bits on 64-bit systems.

As such, it may be beneficial to expand Vector<T> to additionally support these types so we can get rid of the using aliases and support performing the cross-platform vector operations using these new primitive types.

Proposal

Extend Vector<T> to support nint and nuint as valid primitive types. This will extend a number of existing generic functions which take a Vector<T> to also support taking the new types rather than throwing a PlatformNotSupportedException.

Additionally, the following non-generic APIs should be added for parity with the existing surface area:

namespace System.Numerics
{
    public partial struct Vector<T>
    {
        public static explicit operator Vector<nint>(Vector<T> value);
        public static explicit operator Vector<nuint>(Vector<T> value);
    }

    public static partial class Vector
    {
        public static Vector<nint> AsVectorIntPtr<T>(Vector<T> value);
        public static Vector<nuint> AsVectorUIntPtr<T>(Vector<T> value);

        public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
    }
}

Other Considerations

For API names, we have a guideline that states to use the framework name rather than the language keyword name (e.g. Int32 and Int64 rather than int and long). However, the framework name for nint is IntPtr but the operators exposed and general use case for the two types is somewhat different, as such a name such as NInt and NUInt may be a better alternative. Whatever we choose, it should likely become the standard for nint moving forward.

The same request could be made for System.Runtime.Intrinsics, but the API bloat for this would be much larger and would need further consideration. It might be worthwhile to allow nint/nuint as valid T without exposing the additional overloads initially as that would at least unblock users from providing such APIs themselves.

Answer 1 · 2020-05-09T05:34:27.000Z

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

Answer 2 · 2020-05-09T05:35:01.000Z

CC. @pgovind, @CarolEidt, @echesakovMSFT

Answer 3 · 2020-05-09T05:38:35.000Z

We have an existing use case in the runtime in Utf16Utility.Validation.cs where we are using the following to workaround not having this:

#if TARGET_64BIT
using nuint_t = System.UInt64;
#else
using nuint_t = System.UInt32;
#endif

Answer 4 · 2020-05-16T03:00:58.000Z

Could you also support any reference types?
This could be so much powerful.(e.g. Collection look up)
Ofcource, we could not cast other type vectors to reference type vector.

Answer 5 · 2020-05-16T05:05:20.000Z

Could you also support any reference types?

As I understand it, that would be a non-trivial ask and likely not for a lot of benefit. Many of the operations exposed aren't valid or don't make sense for reference types and the JIT doesn't support tracking GC types in the SIMD registers.
There are likely other considerations I'm not aware of as well and which would best be answered by someone like @jkotas or @Maoni0.

Answer 6 · 2020-05-16T20:54:15.000Z

I wonder if we could optimistically handle fast collection lookup anyway, even without support for reference types in SIMD. Consider that you have a T[] (where T is a reference type), and you want to quickly look up the index of any given T within the array. Assume referential equality checks, not deep equality checks.

The algorithm would then be as follows:

Pin the T instance that represents your search target, then project this T to a void* then to nuint.
Project the underlying T[] as a Span<nuint>.
Perform a vectorized search within the Span<nuint> looking for the "target" nuint.

Since the target is pinned, it cannot be moved by the GC, so you don't have to worry about having the GC track individual elements within the SIMD registers. It's possible that the GC might move other elements of the array while you're inspecting it, but that's ok since neither their original addresses nor their modified addresses will match the pinned address of the target you're seeking. This means you're guaranteed zero false positives and zero false negatives.

Again, this would only work for referential equality checks. Since I'm not sure how common referential vs. deep equality checks are I'm not sure how much benefit this would offer in practice.

Answer 7 · 2020-05-17T03:08:08.000Z

BTW,maybe we also want support for enum.

Answer 8 · 2020-05-17T03:23:08.000Z

Pin the T instance that represents your search target, then project this T to a void* then to nuint.

Maybe we need some compiler specific support or JIT intrinsic for done this effectively.
Such as

IntPtr e0,e1,e2,e3;//fixed-pointers
Unsafe.WriteUnaligned(ref Unsafe.As<IntPtr,byte>(ref e0),Unsafe.ReadUnaligned<Vector<nint>>(ref Unsafe.As<T,byte>(ref array[0])));
var vector = Unsafe.ReadUnaligned<Vector<nint>>(ref e0);

Answer 9 · 2020-05-17T03:52:44.000Z

C# going to have record type feature.
So compiler can provide detailed,and structured information about it's equality.
If Vector<object> is supported (or work around with nint works), we could compare deep-equality by SIMD.

This is a pseudo code.

[EquatabilityContract]
[CompilerGenerated]
class Record
{
   [EquatabilityMember]
   long id;
}
Record[] records;
Record Find(long id)
{
   for(var i = 0;i + Vector256<object>.Count <records.Length;i+=Vector256<object>.Count)
   {
      var vector = Unsafe.ReadUnaligned<Vector256<object>>(ref records[i]);
      var ids = Avx2.GatherVector256<long>(JitHelper.OffsetOf<Record>(tokenof(id)),vector);//"tokenof" means member definition reference in IL.Compiler could generate it
      var compare = Avx2.CompareEqual(vector,Vector256.Create(id));
      var mm = Avx2.MoveMask(compare.AsBytes());
      if(mm!=0)
      {
         return Unsafe.As<Record>(vector.GetElement(mm/IntPtr.Size));
      }
   }
}

Answer 10 · 2020-08-06T18:49:03.000Z

Video

We should use NInt and NUInt as the the non-keyword types names for APIs that have to refer to nint and nuint, as opposed to IntPtr and UIntPtr,

namespace System.Numerics
{
    public partial struct Vector<T>
    {
        public static explicit operator Vector<nint>(Vector<T> value);
        public static explicit operator Vector<nuint>(Vector<T> value);
    }

    public static partial class Vector
    {
        public static Vector<nint> AsVectorNInt<T>(Vector<T> value);
        public static Vector<nuint> AsVectorNUInt<T>(Vector<T> value);

        public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
    }
}

Answer 11 · 2020-09-08T20:41:43.000Z

@jkotas, @CarolEidt: How important is maintaining perf of Vector<T> for platforms without acceleration (such as ARM32)?

Today, much of the code is generated via T4 templates and internally uses the Register struct, which is explicit layout with overlapping data.

However, given the existence of System.Runtime.CompilerServices.Unsafe we could remove the T4 templates and rewrite the software fallback logic to be generic loops instead which would greatly simplify the code and make it easier to insert the nint and nuint support, which doesn't work as nicely with the T4 setup.

The T4 templates also rarely work with the current S.P.Corelib setup and so the benefit of it is basically non-existent today

This would also remove duplication between the IsHardwareAccelerated and fallback paths, where the former uses for loops and the latter is manually unrolled.

The IsHardwareAccelerated path is relying on the basic loop unrolling support that exists for Vector<T>.Count, which we could enable for ARM32 if perf is a concern.

Answer 12 · 2020-09-08T21:26:15.000Z

Do you have any numbers for how much regression we would potentially see on ARM32?

It would impact Mono target platforms too. @marek-safar How much do we care about non-accelerated Vector<T> performance on Mono?

Answer 13 · 2020-09-08T21:43:30.000Z

Do you have any numbers for how much regression we would potentially see on ARM32?

I don't have any currently but I can try and get some.

I imagine some parts would be improved, especially in user code, due to less data being zeroed (its currently a 16-byte, 66 field explicit layout struct). But some parts, namely the methods in S.P.Corelib, would likely regress due to running a for loop, rather than a manually unrolled loop (although several functions aren't unrolled today).

Answer 14 · 2020-09-09T06:57:05.000Z

How much do we care about non-accelerated Vector performance on Mono?

@jkotas we don't care about performance for that config

Answer 15 · 2020-09-09T23:05:09.000Z

Do you have any numbers for how much regression we would potentially see on ARM32?

@jkotas, it looks like the worst case is about a 4x* perf regression (for byte which has 16 elements). This only looks to be the case on ARM32 where the loop unrolling support around Vector<T>.Count doesn't exist.

Some, like LessThanOrEqualAll is actually closer to 6x, but this is because it does LessThan || Equals, rather than doing an efficient "single pass".

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores

Current:

Method	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
CountBenchmark	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	-	-	-	-
OneBenchmark	11.6882 ns	0.0298 ns	0.0264 ns	11.6883 ns	11.6550 ns	11.7447 ns	-	-	-	-
ZeroBenchmark	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	-	-	-	-
EqualsBenchmark	3.7958 ns	0.0291 ns	0.0258 ns	3.8016 ns	3.7472 ns	3.8370 ns	-	-	-	-
GetHashCodeBenchmark	99.3895 ns	1.4870 ns	1.2417 ns	99.2845 ns	97.4959 ns	101.7576 ns	-	-	-	-
AddOperatorBenchmark	14.7578 ns	0.0843 ns	0.0748 ns	14.7397 ns	14.6672 ns	14.9465 ns	-	-	-	-
BitwiseAndOperatorBenchmark	12.2149 ns	0.0902 ns	0.0800 ns	12.2393 ns	12.0871 ns	12.3594 ns	-	-	-	-
BitwiseOrOperatorBenchmark	9.1529 ns	0.0340 ns	0.0302 ns	9.1471 ns	9.1069 ns	9.2057 ns	-	-	-	-
DivisionOperatorBenchmark	79.2174 ns	0.0941 ns	0.0786 ns	79.1944 ns	79.1173 ns	79.3931 ns	-	-	-	-
EqualityOperatorBenchmark	0.7216 ns	0.0280 ns	0.0248 ns	0.7208 ns	0.6867 ns	0.7715 ns	-	-	-	-
ExclusiveOrOperatorBenchmark	9.1752 ns	0.0468 ns	0.0438 ns	9.1527 ns	9.1113 ns	9.2425 ns	-	-	-	-
InequalityOperatorBenchmark	3.9465 ns	0.0683 ns	0.0605 ns	3.9228 ns	3.8851 ns	4.0766 ns	-	-	-	-
MultiplyOperatorBenchmark	15.1731 ns	0.0715 ns	0.0633 ns	15.1714 ns	15.0914 ns	15.3134 ns	-	-	-	-
OnesComplementOperatorBenchmark	21.1975 ns	0.0804 ns	0.0671 ns	21.2021 ns	21.0049 ns	21.2608 ns	-	-	-	-
SubtractionOperatorBenchmark	15.2344 ns	0.0691 ns	0.0613 ns	15.2315 ns	15.1622 ns	15.3523 ns	-	-	-	-
UnaryNegateOperatorBenchmark	17.3482 ns	0.0274 ns	0.0243 ns	17.3372 ns	17.3272 ns	17.4067 ns	-	-	-	-
AbsBenchmark	4.8659 ns	0.0293 ns	0.0259 ns	4.8537 ns	4.8366 ns	4.9118 ns	-	-	-	-
AddBenchmark	18.1764 ns	0.0569 ns	0.0504 ns	18.1810 ns	18.1020 ns	18.2635 ns	-	-	-	-
AndNotBenchmark	29.3541 ns	0.1666 ns	0.1558 ns	29.3400 ns	29.1411 ns	29.6693 ns	-	-	-	-
BitwiseAndBenchmark	10.0123 ns	0.0400 ns	0.0355 ns	10.0063 ns	9.9612 ns	10.0709 ns	-	-	-	-
BitwiseOrBenchmark	10.0021 ns	0.1060 ns	0.0991 ns	9.9950 ns	9.8593 ns	10.1830 ns	-	-	-	-
ConditionalSelectBenchmark	49.0080 ns	0.3078 ns	0.2879 ns	49.0135 ns	48.4501 ns	49.5042 ns	-	-	-	-
DivideBenchmark	78.8175 ns	0.2677 ns	0.2373 ns	78.7779 ns	78.4765 ns	79.2487 ns	-	-	-	-
DotBenchmark	15.2722 ns	0.0261 ns	0.0231 ns	15.2709 ns	15.2363 ns	15.3179 ns	-	-	-	-
EqualsStaticBenchmark	22.6560 ns	0.1268 ns	0.1186 ns	22.6159 ns	22.5144 ns	22.9051 ns	-	-	-	-
EqualsAllBenchmark	4.1479 ns	0.0527 ns	0.0467 ns	4.1447 ns	4.0745 ns	4.2333 ns	-	-	-	-
EqualsAnyBenchmark	29.4765 ns	0.0926 ns	0.0866 ns	29.4362 ns	29.3856 ns	29.6265 ns	-	-	-	-
GreaterThanBenchmark	21.7368 ns	0.0631 ns	0.0559 ns	21.7294 ns	21.6671 ns	21.8384 ns	-	-	-	-
GreaterThanAllBenchmark	32.9013 ns	0.0527 ns	0.0493 ns	32.9126 ns	32.8253 ns	32.9649 ns	-	-	-	-
GreaterThanAnyBenchmark	28.0047 ns	0.0764 ns	0.0714 ns	27.9833 ns	27.9141 ns	28.1723 ns	-	-	-	-
GreaterThanOrEqualBenchmark	52.3921 ns	0.3169 ns	0.2646 ns	52.2974 ns	52.1671 ns	53.0578 ns	-	-	-	-
GreaterThanOrEqualAllBenchmark	60.9816 ns	0.1275 ns	0.1065 ns	60.9825 ns	60.7853 ns	61.2152 ns	-	-	-	-
GreaterThanOrEqualAnyBenchmark	58.1081 ns	0.1660 ns	0.1386 ns	58.0799 ns	57.8858 ns	58.3585 ns	-	-	-	-
LessThanBenchmark	20.2578 ns	0.1129 ns	0.1000 ns	20.2367 ns	20.0859 ns	20.4389 ns	-	-	-	-
LessThanAllBenchmark	33.3310 ns	0.1187 ns	0.0991 ns	33.3558 ns	33.1460 ns	33.5284 ns	-	-	-	-
LessThanAnyBenchmark	25.5108 ns	0.0980 ns	0.0869 ns	25.5065 ns	25.3958 ns	25.6639 ns	-	-	-	-
LessThanOrEqualBenchmark	49.8892 ns	0.1327 ns	0.1176 ns	49.8391 ns	49.7525 ns	50.1483 ns	-	-	-	-
LessThanOrEqualAllBenchmark	60.0621 ns	0.1442 ns	0.1349 ns	60.0328 ns	59.8471 ns	60.2880 ns	-	-	-	-
LessThanOrEqualAnyBenchmark	55.2470 ns	0.0496 ns	0.0388 ns	55.2529 ns	55.1691 ns	55.3128 ns	-	-	-	-
MaxBenchmark	21.8661 ns	0.0363 ns	0.0322 ns	21.8606 ns	21.8211 ns	21.9399 ns	-	-	-	-
MinBenchmark	21.0789 ns	0.0779 ns	0.0691 ns	21.0662 ns	20.9944 ns	21.2350 ns	-	-	-	-
MultiplyBenchmark	18.0297 ns	0.0527 ns	0.0467 ns	18.0289 ns	17.9634 ns	18.1398 ns	-	-	-	-
NegateBenchmark	18.5310 ns	0.0626 ns	0.0555 ns	18.5090 ns	18.4585 ns	18.6348 ns	-	-	-	-
OnesComplementBenchmark	20.7130 ns	0.1875 ns	0.1754 ns	20.7374 ns	20.3737 ns	20.9449 ns	-	-	-	-
SquareRootBenchmark	34.9111 ns	0.0887 ns	0.0786 ns	34.8872 ns	34.8279 ns	35.0818 ns	-	-	-	-
SubtractBenchmark	18.0638 ns	0.0379 ns	0.0336 ns	18.0603 ns	17.9950 ns	18.1135 ns	-	-	-	-
XorBenchmark	10.2635 ns	0.1394 ns	0.1304 ns	10.2707 ns	10.0667 ns	10.5332 ns	-	-	-	-

Generic Unsafe implementation:
// * Summary *

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=6.0.100-alpha.1.20459.9
[Host] : .NET Core 6.0.0 (CoreCLR 6.0.20.45424, CoreFX 6.0.20.45424), Arm RyuJIT
Job-XVEAND : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), Arm RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
WarmupCount=1

Method	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
CountBenchmark	0.6085 ns	0.0292 ns	0.0259 ns	0.6163 ns	0.5516 ns	0.6352 ns	-	-	-	-
OneBenchmark	15.0520 ns	0.0734 ns	0.0687 ns	15.0311 ns	14.9681 ns	15.1906 ns	-	-	-	-
ZeroBenchmark	0.7316 ns	0.1004 ns	0.1156 ns	0.7163 ns	0.5323 ns	0.9691 ns	-	-	-	-
EqualsBenchmark	4.0958 ns	0.0461 ns	0.0408 ns	4.0859 ns	4.0474 ns	4.1932 ns	-	-	-	-
GetHashCodeBenchmark	308.9952 ns	2.5950 ns	2.3004 ns	308.3442 ns	306.2050 ns	314.2366 ns	-	-	-	-
AddOperatorBenchmark	61.4049 ns	0.0881 ns	0.0781 ns	61.3829 ns	61.2802 ns	61.5378 ns	-	-	-	-
BitwiseAndOperatorBenchmark	8.7401 ns	0.0907 ns	0.0804 ns	8.7529 ns	8.4814 ns	8.8201 ns	-	-	-	-
BitwiseOrOperatorBenchmark	5.8710 ns	0.0195 ns	0.0163 ns	5.8664 ns	5.8459 ns	5.8993 ns	-	-	-	-
DivisionOperatorBenchmark	243.7459 ns	6.7311 ns	7.7516 ns	244.9533 ns	227.9078 ns	257.7710 ns	-	-	-	-
EqualityOperatorBenchmark	0.9734 ns	0.0154 ns	0.0136 ns	0.9692 ns	0.9549 ns	1.0001 ns	-	-	-	-
ExclusiveOrOperatorBenchmark	5.9003 ns	0.0257 ns	0.0228 ns	5.8996 ns	5.8748 ns	5.9447 ns	-	-	-	-
InequalityOperatorBenchmark	3.5203 ns	0.0322 ns	0.0285 ns	3.5061 ns	3.4928 ns	3.5887 ns	-	-	-	-
MultiplyOperatorBenchmark	61.9186 ns	0.1388 ns	0.1159 ns	61.9120 ns	61.7196 ns	62.1257 ns	-	-	-	-
OnesComplementOperatorBenchmark	21.6788 ns	0.1246 ns	0.1105 ns	21.6766 ns	21.5047 ns	21.8764 ns	-	-	-	-
SubtractionOperatorBenchmark	60.1066 ns	0.2208 ns	0.1844 ns	60.1600 ns	59.8257 ns	60.3687 ns	-	-	-	-
UnaryNegateOperatorBenchmark	67.4316 ns	0.2463 ns	0.2183 ns	67.3973 ns	67.1599 ns	67.8942 ns	-	-	-	-
AbsBenchmark	10.3173 ns	0.1574 ns	0.1472 ns	10.2980 ns	10.0591 ns	10.5487 ns	-	-	-	-
AddBenchmark	65.3602 ns	0.1875 ns	0.1662 ns	65.3597 ns	65.0336 ns	65.6387 ns	-	-	-	-
AndNotBenchmark	22.8476 ns	0.0983 ns	0.0872 ns	22.8684 ns	22.6440 ns	22.9391 ns	-	-	-	-
BitwiseAndBenchmark	10.1515 ns	0.1625 ns	0.1520 ns	10.1560 ns	9.8568 ns	10.3946 ns	-	-	-	-
BitwiseOrBenchmark	8.7920 ns	0.0685 ns	0.0641 ns	8.7941 ns	8.6943 ns	8.9188 ns	-	-	-	-
ConditionalSelectBenchmark	35.1013 ns	0.0871 ns	0.0814 ns	35.0999 ns	34.9701 ns	35.2477 ns	-	-	-	-
DivideBenchmark	230.8056 ns	3.8200 ns	3.5733 ns	231.1309 ns	223.5282 ns	236.5047 ns	-	-	-	-
DotBenchmark	106.8848 ns	1.5138 ns	1.4160 ns	106.6555 ns	104.8663 ns	109.8214 ns	-	-	-	-
EqualsStaticBenchmark	38.7218 ns	0.1433 ns	0.1341 ns	38.6835 ns	38.5266 ns	38.9829 ns	-	-	-	-
EqualsAllBenchmark	3.5746 ns	0.0356 ns	0.0316 ns	3.5639 ns	3.5331 ns	3.6283 ns	-	-	-	-
EqualsAnyBenchmark	57.1994 ns	0.2407 ns	0.2134 ns	57.1643 ns	56.9201 ns	57.6293 ns	-	-	-	-
GreaterThanBenchmark	36.8960 ns	0.0887 ns	0.0786 ns	36.8852 ns	36.7171 ns	37.0227 ns	-	-	-	-
GreaterThanAllBenchmark	105.6515 ns	0.7164 ns	0.6351 ns	105.6380 ns	104.6500 ns	106.7682 ns	-	-	-	-
GreaterThanAnyBenchmark	94.5897 ns	1.4236 ns	1.3316 ns	94.5560 ns	92.3444 ns	97.3495 ns	-	-	-	-
GreaterThanOrEqualBenchmark	144.2161 ns	0.3195 ns	0.2832 ns	144.1130 ns	143.8592 ns	144.8957 ns	-	-	-	-
GreaterThanOrEqualAllBenchmark	225.7534 ns	0.7733 ns	0.6457 ns	225.8675 ns	224.5361 ns	226.7384 ns	-	-	-	-
GreaterThanOrEqualAnyBenchmark	214.1536 ns	1.0426 ns	0.9752 ns	213.6999 ns	213.0955 ns	216.1802 ns	-	-	-	-
LessThanBenchmark	34.9333 ns	0.1168 ns	0.1036 ns	34.9611 ns	34.7754 ns	35.1236 ns	-	-	-	-
LessThanAllBenchmark	162.8080 ns	1.7153 ns	1.6045 ns	162.2848 ns	160.3714 ns	165.8838 ns	-	-	-	-
LessThanAnyBenchmark	44.2883 ns	0.4552 ns	0.4258 ns	44.3683 ns	43.1221 ns	44.9381 ns	-	-	-	-
LessThanOrEqualBenchmark	183.8210 ns	0.4592 ns	0.4071 ns	183.8070 ns	183.3386 ns	184.5908 ns	-	-	-	-
LessThanOrEqualAllBenchmark	401.9115 ns	8.9488 ns	10.3055 ns	402.8547 ns	384.7896 ns	423.1362 ns	-	-	-	-
LessThanOrEqualAnyBenchmark	265.0211 ns	3.5067 ns	3.2801 ns	264.9297 ns	260.6478 ns	270.4116 ns	-	-	-	-
MaxBenchmark	40.3965 ns	0.1073 ns	0.1004 ns	40.4098 ns	40.1970 ns	40.5682 ns	-	-	-	-
MinBenchmark	39.0784 ns	0.1081 ns	0.0959 ns	39.0810 ns	38.9268 ns	39.2049 ns	-	-	-	-
MultiplyBenchmark	63.0806 ns	0.2723 ns	0.2547 ns	63.0654 ns	62.6993 ns	63.6056 ns	-	-	-	-
NegateBenchmark	63.1924 ns	0.1658 ns	0.1550 ns	63.2034 ns	62.9797 ns	63.5404 ns	-	-	-	-
OnesComplementBenchmark	22.2950 ns	0.1434 ns	0.1271 ns	22.3171 ns	21.9820 ns	22.4744 ns	-	-	-	-
SquareRootBenchmark	129.6186 ns	0.5511 ns	0.4885 ns	129.5628 ns	128.9000 ns	130.5883 ns	-	-	-	-
SubtractBenchmark	66.1447 ns	0.3725 ns	0.3484 ns	66.1251 ns	65.3131 ns	66.5934 ns	-	-	-	-
XorBenchmark	9.4515 ns	0.1190 ns	0.1113 ns	9.4744 ns	9.2758 ns	9.6036 ns	-	-	-	-

Answer 16 · 2020-09-09T23:20:22.000Z

Thanks for collecting the data. I am supportive of your proposal to simplify the template.

Answer 17 · 2020-09-11T23:42:14.000Z

I think that removing the T4 templates is the right thing to do, but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.

Answer 18 · 2020-09-11T23:46:58.000Z

but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.

I'll take a look and see if adding the loop unrolling support is trivial. From what I recall on doing it for Vector128<T>.Count, it might just be treating Count as intrinsic and ensuring the right flag is set.

Answer 19 · 2020-09-11T23:49:17.000Z

@tannergooding - that would be good, but perhaps that could also be done as a "cleanup" PR that includes the work to figure out why the HFA classification requires the Register type.

Answer 20 · 2020-09-14T21:10:56.000Z

So we have the SIMDHandlesCache which is set based on matching the name of the generic type: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330.

However, we then map that later based on just the simdBaseType:

However, TYP_I_MPL and TYP_U_IMPL are themselves merely aliases for TYP_LONG/TYP_INT and TYP_ULONG/TYP_UINT, respectively; so the caching logic breaks down: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/vartype.h#L35-L50

Is there some way to get the handle for Vector<ulong>/Vector<uint> when we encounter a Vector<nuint> so the cache won't be invalid during later lookups or is there someway to track that a given TYP_ULONG is actually a TYP_U_IMPL (existing support that is)?

CC. @CarolEidt, @dotnet/jit-contrib

Answer 21 · 2020-10-05T17:30:54.000Z

@jkotas, is there currently a JIT/EE method that can be used to resolve a handle? This would end up being "pay for play" as it would only be resolved once and only if a Vector<nint> or Vector<nuint> was the used handle.

I was also looking at finishing connecting the m_layout field as an alternative, but there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle.

Answer 22 · 2020-10-05T17:54:24.000Z

Does getTypeInstantiationArgument method on JIT/EE help?

Answer 23 · 2020-10-05T18:07:32.000Z

I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.

there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle

That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that. I confess I haven't investigated this in detail, but it doesn't seem like there's any other reasonable way to get class info if you don't have a handle (getTypeInstantiationArgument requires a handle).

Answer 24 · 2020-10-05T18:28:26.000Z

Does getTypeInstantiationArgument method on JIT/EE help?

I don't think so. The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330
So what we have is a Vector`1[System.IntPtr], however we only track the simdBaseType in the created node, which will be TYP_I_IMPL, which is just an alias for TYP_LONG

This means that when we later try and get the handle back, such as in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.cpp#L17354-L17356, we will try to get a handle for Vector`1[System.Int64], which doesn't exist.

I was hoping there was some way to, for example, from a class handle for Vector`1[System.IntPtr], resolve the class handle for Vector`1[System.Int64]. This would largely solve some of the handle lookup problems we have both for this case and cases where we are returning a type that hasn't been resolved (say for example, we internally create a node that temporarily treats a Vector`1[System.Single] as a Vector`1[System.Int32] -- we do this or similar in a few places).

I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.
That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that.

I think this, in general, probably needs a bit of thought.

We also have a couple issues around rewriting intrinsics into user calls in lowering (such as to better handle operands that later become constants) in which case I think we want to carry a CORINFO_METHOD_HANDLE instead (or in addition to). Like GenTreeIntrinsic does: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.h#L4743-L4748 and with the rewriting: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/rationalize.cpp#L265-L284)

I believe you can get the ClassLayout for a given return type or parameter from the method handle, but I think we'd also have the issue where we may not have a method handle (such as for an intrinsic node created to handle a helper method, for example).

Answer 25 · 2020-10-05T19:15:16.000Z

The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330

BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.

We can add JIT/EE interface methods that let you create new generic instantiations if it helps JIT to represent things. JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations. Otherwise, it would cause problems for AOT.

Answer 26 · 2020-10-05T19:58:30.000Z

BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.

This is what getTypeInstantiationArgument is for, correct? It looks like we are actually using that for the Vector64/128/256<T> path: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L578...

Is it prohibitively expensive if we used it exclusively or would we still need some caching layer to help?

JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations

I think that's reasonable and don't believe we are using the handles for anything like that today.

What I'd like to do, ideally, is come to some middle ground between where we are today with the SIMDHandleCache and where we'd like to be with the ClassLayout info. I believe the latter is quite a bit more work, while the former would at least unblock this issue and related scenarios around accelerating Quaternion/Plane (which are functionally a Vector4 for most operations) and implementing the generic Vector2/3/4<T> types.

It sounds like we might be able to do something like:

Fixup the Vector<T> cache to use getTypeInstantiationArgument, which resolves the name lookup but keeps the mismatch
Track the CORINFO_TYPE_* rather than the TYP_ for the simdBaseType (it is trivial to go from the former to the latter)

This would allow us to track the proper base type and utilize Vector<T> for nint and nuint. We could then continue the discussions around how to handle ClassLayout and and if we can have some "synthetic" class layout like @CarolEidt suggested.

Answer 27 · 2020-11-05T17:19:15.000Z

I got most of the changes done (it rounds out to +812, -722 lines, approx), but ran into a bit of "snag".

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

This isn't hard to handle, but I was wondering if someone could explain the reason for the difference here? At first glance it looks like a bug, but I'm guessing there is some reason for it.
I don't see any surrounding comments, nor comments in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/corinfo.h or https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/typelist.h

Answer 28 · 2020-11-05T17:33:49.000Z

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

Isn't this related to the fact that CLI models only 3 integral types on the computation stack - int32, native int and int64. Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?

Answer 29 · 2020-11-05T17:40:24.000Z

Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?

The base type is distinct from the node type and while they often match up, there are many cases where they don't.

Isn't this related to the fact that CLI models only 3 integral types on the computation stack

That was my initial thought, but byte, ubyte, short, and ushort not mapping to int ends up not lining up with that and much of the JIT logic uses genActualType() to convert a var_types to the stack type, where that is relevant.

Answer 30 · 2020-11-06T00:09:42.000Z

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

I'm not sure why this is the case. It does seem quite inconsistent, but I would bet that there's JIT code somewhere that depends on it.