Consider expanding `Vector<T>` to support `nint` and `nuint`
tannergooding opened this issue ยท 30 comments
Rationale
Today Vector<T>
supports the following 10 primitive types: byte
, sbyte
, short
, ushort
, int
, uint
, long
, ulong
, float
, and double
.
C# 9 is introducing support for nint
and nuint
which are variable sized integers matching the bitness of the underlying platform. That is, they are 32-bits on 32-bit systems and 64-bits on 64-bit systems.
As such, it may be beneficial to expand Vector<T>
to additionally support these types so we can get rid of the using aliases and support performing the cross-platform vector operations using these new primitive types.
Proposal
Extend Vector<T>
to support nint
and nuint
as valid primitive types. This will extend a number of existing generic functions which take a Vector<T>
to also support taking the new types rather than throwing a PlatformNotSupportedException
.
Additionally, the following non-generic APIs should be added for parity with the existing surface area:
namespace System.Numerics
{
public partial struct Vector<T>
{
public static explicit operator Vector<nint>(Vector<T> value);
public static explicit operator Vector<nuint>(Vector<T> value);
}
public static partial class Vector
{
public static Vector<nint> AsVectorIntPtr<T>(Vector<T> value);
public static Vector<nuint> AsVectorUIntPtr<T>(Vector<T> value);
public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
}
}
Other Considerations
For API names, we have a guideline that states to use the framework name rather than the language keyword name (e.g. Int32
and Int64
rather than int
and long
). However, the framework name for nint
is IntPtr
but the operators exposed and general use case for the two types is somewhat different, as such a name such as NInt
and NUInt
may be a better alternative. Whatever we choose, it should likely become the standard for nint
moving forward.
The same request could be made for System.Runtime.Intrinsics
, but the API bloat for this would be much larger and would need further consideration. It might be worthwhile to allow nint
/nuint
as valid T
without exposing the additional overloads initially as that would at least unblock users from providing such APIs themselves.
Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.
CC. @pgovind, @CarolEidt, @echesakovMSFT
We have an existing use case in the runtime in Utf16Utility.Validation.cs where we are using the following to workaround not having this:
#if TARGET_64BIT
using nuint_t = System.UInt64;
#else
using nuint_t = System.UInt32;
#endif
Could you also support any reference types?
This could be so much powerful.(e.g. Collection look up)
Ofcource, we could not cast other type vectors to reference type vector.
Could you also support any reference types?
As I understand it, that would be a non-trivial ask and likely not for a lot of benefit. Many of the operations exposed aren't valid or don't make sense for reference types and the JIT doesn't support tracking GC types in the SIMD registers.
There are likely other considerations I'm not aware of as well and which would best be answered by someone like @jkotas or @Maoni0.
I wonder if we could optimistically handle fast collection lookup anyway, even without support for reference types in SIMD. Consider that you have a T[] (where T is a reference type), and you want to quickly look up the index of any given T within the array. Assume referential equality checks, not deep equality checks.
The algorithm would then be as follows:
- Pin the T instance that represents your search target, then project this T to a
void*
then tonuint
. - Project the underlying T[] as a
Span<nuint>
. - Perform a vectorized search within the
Span<nuint>
looking for the "target"nuint
.
Since the target is pinned, it cannot be moved by the GC, so you don't have to worry about having the GC track individual elements within the SIMD registers. It's possible that the GC might move other elements of the array while you're inspecting it, but that's ok since neither their original addresses nor their modified addresses will match the pinned address of the target you're seeking. This means you're guaranteed zero false positives and zero false negatives.
Again, this would only work for referential equality checks. Since I'm not sure how common referential vs. deep equality checks are I'm not sure how much benefit this would offer in practice.
BTW,maybe we also want support for enum
.
- Pin the T instance that represents your search target, then project this T to a
void*
then tonuint
.
Maybe we need some compiler specific support or JIT intrinsic for done this effectively.
Such as
IntPtr e0,e1,e2,e3;//fixed-pointers
Unsafe.WriteUnaligned(ref Unsafe.As<IntPtr,byte>(ref e0),Unsafe.ReadUnaligned<Vector<nint>>(ref Unsafe.As<T,byte>(ref array[0])));
var vector = Unsafe.ReadUnaligned<Vector<nint>>(ref e0);
C# going to have record type feature.
So compiler can provide detailed,and structured information about it's equality.
If Vector<object>
is supported (or work around with nint
works), we could compare deep-equality by SIMD.
This is a pseudo code.
[EquatabilityContract]
[CompilerGenerated]
class Record
{
[EquatabilityMember]
long id;
}
Record[] records;
Record Find(long id)
{
for(var i = 0;i + Vector256<object>.Count <records.Length;i+=Vector256<object>.Count)
{
var vector = Unsafe.ReadUnaligned<Vector256<object>>(ref records[i]);
var ids = Avx2.GatherVector256<long>(JitHelper.OffsetOf<Record>(tokenof(id)),vector);//"tokenof" means member definition reference in IL.Compiler could generate it
var compare = Avx2.CompareEqual(vector,Vector256.Create(id));
var mm = Avx2.MoveMask(compare.AsBytes());
if(mm!=0)
{
return Unsafe.As<Record>(vector.GetElement(mm/IntPtr.Size));
}
}
}
- We should use
NInt
andNUInt
as the the non-keyword types names for APIs that have to refer tonint
andnuint
, as opposed toIntPtr
andUIntPtr
,
namespace System.Numerics
{
public partial struct Vector<T>
{
public static explicit operator Vector<nint>(Vector<T> value);
public static explicit operator Vector<nuint>(Vector<T> value);
}
public static partial class Vector
{
public static Vector<nint> AsVectorNInt<T>(Vector<T> value);
public static Vector<nuint> AsVectorNUInt<T>(Vector<T> value);
public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);
public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
}
}
@jkotas, @CarolEidt: How important is maintaining perf of Vector<T>
for platforms without acceleration (such as ARM32)?
Today, much of the code is generated via T4 templates and internally uses the Register
struct, which is explicit layout with overlapping data.
However, given the existence of System.Runtime.CompilerServices.Unsafe
we could remove the T4 templates and rewrite the software fallback logic to be generic loops instead which would greatly simplify the code and make it easier to insert the nint
and nuint
support, which doesn't work as nicely with the T4 setup.
- The T4 templates also rarely work with the current S.P.Corelib setup and so the benefit of it is basically non-existent today
This would also remove duplication between the IsHardwareAccelerated
and fallback paths, where the former uses for
loops and the latter is manually unrolled.
- The
IsHardwareAccelerated
path is relying on the basic loop unrolling support that exists forVector<T>.Count
, which we could enable for ARM32 if perf is a concern.
Do you have any numbers for how much regression we would potentially see on ARM32?
It would impact Mono target platforms too. @marek-safar How much do we care about non-accelerated Vector<T>
performance on Mono?
Do you have any numbers for how much regression we would potentially see on ARM32?
I don't have any currently but I can try and get some.
I imagine some parts would be improved, especially in user code, due to less data being zeroed (its currently a 16-byte, 66 field explicit layout struct). But some parts, namely the methods in S.P.Corelib, would likely regress due to running a for
loop, rather than a manually unrolled loop (although several functions aren't unrolled today).
How much do we care about non-accelerated Vector performance on Mono?
@jkotas we don't care about performance for that config
Do you have any numbers for how much regression we would potentially see on ARM32?
@jkotas, it looks like the worst case is about a 4x* perf regression (for byte
which has 16 elements). This only looks to be the case on ARM32 where the loop unrolling support around Vector<T>.Count
doesn't exist.
- Some, like
LessThanOrEqualAll
is actually closer to 6x, but this is because it doesLessThan || Equals
, rather than doing an efficient "single pass".
BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
Current:
Method | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
CountBenchmark | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns | - | - | - | - |
OneBenchmark | 11.6882 ns | 0.0298 ns | 0.0264 ns | 11.6883 ns | 11.6550 ns | 11.7447 ns | - | - | - | - |
ZeroBenchmark | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns | 0.0000 ns | - | - | - | - |
EqualsBenchmark | 3.7958 ns | 0.0291 ns | 0.0258 ns | 3.8016 ns | 3.7472 ns | 3.8370 ns | - | - | - | - |
GetHashCodeBenchmark | 99.3895 ns | 1.4870 ns | 1.2417 ns | 99.2845 ns | 97.4959 ns | 101.7576 ns | - | - | - | - |
AddOperatorBenchmark | 14.7578 ns | 0.0843 ns | 0.0748 ns | 14.7397 ns | 14.6672 ns | 14.9465 ns | - | - | - | - |
BitwiseAndOperatorBenchmark | 12.2149 ns | 0.0902 ns | 0.0800 ns | 12.2393 ns | 12.0871 ns | 12.3594 ns | - | - | - | - |
BitwiseOrOperatorBenchmark | 9.1529 ns | 0.0340 ns | 0.0302 ns | 9.1471 ns | 9.1069 ns | 9.2057 ns | - | - | - | - |
DivisionOperatorBenchmark | 79.2174 ns | 0.0941 ns | 0.0786 ns | 79.1944 ns | 79.1173 ns | 79.3931 ns | - | - | - | - |
EqualityOperatorBenchmark | 0.7216 ns | 0.0280 ns | 0.0248 ns | 0.7208 ns | 0.6867 ns | 0.7715 ns | - | - | - | - |
ExclusiveOrOperatorBenchmark | 9.1752 ns | 0.0468 ns | 0.0438 ns | 9.1527 ns | 9.1113 ns | 9.2425 ns | - | - | - | - |
InequalityOperatorBenchmark | 3.9465 ns | 0.0683 ns | 0.0605 ns | 3.9228 ns | 3.8851 ns | 4.0766 ns | - | - | - | - |
MultiplyOperatorBenchmark | 15.1731 ns | 0.0715 ns | 0.0633 ns | 15.1714 ns | 15.0914 ns | 15.3134 ns | - | - | - | - |
OnesComplementOperatorBenchmark | 21.1975 ns | 0.0804 ns | 0.0671 ns | 21.2021 ns | 21.0049 ns | 21.2608 ns | - | - | - | - |
SubtractionOperatorBenchmark | 15.2344 ns | 0.0691 ns | 0.0613 ns | 15.2315 ns | 15.1622 ns | 15.3523 ns | - | - | - | - |
UnaryNegateOperatorBenchmark | 17.3482 ns | 0.0274 ns | 0.0243 ns | 17.3372 ns | 17.3272 ns | 17.4067 ns | - | - | - | - |
AbsBenchmark | 4.8659 ns | 0.0293 ns | 0.0259 ns | 4.8537 ns | 4.8366 ns | 4.9118 ns | - | - | - | - |
AddBenchmark | 18.1764 ns | 0.0569 ns | 0.0504 ns | 18.1810 ns | 18.1020 ns | 18.2635 ns | - | - | - | - |
AndNotBenchmark | 29.3541 ns | 0.1666 ns | 0.1558 ns | 29.3400 ns | 29.1411 ns | 29.6693 ns | - | - | - | - |
BitwiseAndBenchmark | 10.0123 ns | 0.0400 ns | 0.0355 ns | 10.0063 ns | 9.9612 ns | 10.0709 ns | - | - | - | - |
BitwiseOrBenchmark | 10.0021 ns | 0.1060 ns | 0.0991 ns | 9.9950 ns | 9.8593 ns | 10.1830 ns | - | - | - | - |
ConditionalSelectBenchmark | 49.0080 ns | 0.3078 ns | 0.2879 ns | 49.0135 ns | 48.4501 ns | 49.5042 ns | - | - | - | - |
DivideBenchmark | 78.8175 ns | 0.2677 ns | 0.2373 ns | 78.7779 ns | 78.4765 ns | 79.2487 ns | - | - | - | - |
DotBenchmark | 15.2722 ns | 0.0261 ns | 0.0231 ns | 15.2709 ns | 15.2363 ns | 15.3179 ns | - | - | - | - |
EqualsStaticBenchmark | 22.6560 ns | 0.1268 ns | 0.1186 ns | 22.6159 ns | 22.5144 ns | 22.9051 ns | - | - | - | - |
EqualsAllBenchmark | 4.1479 ns | 0.0527 ns | 0.0467 ns | 4.1447 ns | 4.0745 ns | 4.2333 ns | - | - | - | - |
EqualsAnyBenchmark | 29.4765 ns | 0.0926 ns | 0.0866 ns | 29.4362 ns | 29.3856 ns | 29.6265 ns | - | - | - | - |
GreaterThanBenchmark | 21.7368 ns | 0.0631 ns | 0.0559 ns | 21.7294 ns | 21.6671 ns | 21.8384 ns | - | - | - | - |
GreaterThanAllBenchmark | 32.9013 ns | 0.0527 ns | 0.0493 ns | 32.9126 ns | 32.8253 ns | 32.9649 ns | - | - | - | - |
GreaterThanAnyBenchmark | 28.0047 ns | 0.0764 ns | 0.0714 ns | 27.9833 ns | 27.9141 ns | 28.1723 ns | - | - | - | - |
GreaterThanOrEqualBenchmark | 52.3921 ns | 0.3169 ns | 0.2646 ns | 52.2974 ns | 52.1671 ns | 53.0578 ns | - | - | - | - |
GreaterThanOrEqualAllBenchmark | 60.9816 ns | 0.1275 ns | 0.1065 ns | 60.9825 ns | 60.7853 ns | 61.2152 ns | - | - | - | - |
GreaterThanOrEqualAnyBenchmark | 58.1081 ns | 0.1660 ns | 0.1386 ns | 58.0799 ns | 57.8858 ns | 58.3585 ns | - | - | - | - |
LessThanBenchmark | 20.2578 ns | 0.1129 ns | 0.1000 ns | 20.2367 ns | 20.0859 ns | 20.4389 ns | - | - | - | - |
LessThanAllBenchmark | 33.3310 ns | 0.1187 ns | 0.0991 ns | 33.3558 ns | 33.1460 ns | 33.5284 ns | - | - | - | - |
LessThanAnyBenchmark | 25.5108 ns | 0.0980 ns | 0.0869 ns | 25.5065 ns | 25.3958 ns | 25.6639 ns | - | - | - | - |
LessThanOrEqualBenchmark | 49.8892 ns | 0.1327 ns | 0.1176 ns | 49.8391 ns | 49.7525 ns | 50.1483 ns | - | - | - | - |
LessThanOrEqualAllBenchmark | 60.0621 ns | 0.1442 ns | 0.1349 ns | 60.0328 ns | 59.8471 ns | 60.2880 ns | - | - | - | - |
LessThanOrEqualAnyBenchmark | 55.2470 ns | 0.0496 ns | 0.0388 ns | 55.2529 ns | 55.1691 ns | 55.3128 ns | - | - | - | - |
MaxBenchmark | 21.8661 ns | 0.0363 ns | 0.0322 ns | 21.8606 ns | 21.8211 ns | 21.9399 ns | - | - | - | - |
MinBenchmark | 21.0789 ns | 0.0779 ns | 0.0691 ns | 21.0662 ns | 20.9944 ns | 21.2350 ns | - | - | - | - |
MultiplyBenchmark | 18.0297 ns | 0.0527 ns | 0.0467 ns | 18.0289 ns | 17.9634 ns | 18.1398 ns | - | - | - | - |
NegateBenchmark | 18.5310 ns | 0.0626 ns | 0.0555 ns | 18.5090 ns | 18.4585 ns | 18.6348 ns | - | - | - | - |
OnesComplementBenchmark | 20.7130 ns | 0.1875 ns | 0.1754 ns | 20.7374 ns | 20.3737 ns | 20.9449 ns | - | - | - | - |
SquareRootBenchmark | 34.9111 ns | 0.0887 ns | 0.0786 ns | 34.8872 ns | 34.8279 ns | 35.0818 ns | - | - | - | - |
SubtractBenchmark | 18.0638 ns | 0.0379 ns | 0.0336 ns | 18.0603 ns | 17.9950 ns | 18.1135 ns | - | - | - | - |
XorBenchmark | 10.2635 ns | 0.1394 ns | 0.1304 ns | 10.2707 ns | 10.0667 ns | 10.5332 ns | - | - | - | - |
Generic Unsafe
implementation:
// * Summary *
BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=6.0.100-alpha.1.20459.9
[Host] : .NET Core 6.0.0 (CoreCLR 6.0.20.45424, CoreFX 6.0.20.45424), Arm RyuJIT
Job-XVEAND : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), Arm RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
WarmupCount=1
Method | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
CountBenchmark | 0.6085 ns | 0.0292 ns | 0.0259 ns | 0.6163 ns | 0.5516 ns | 0.6352 ns | - | - | - | - |
OneBenchmark | 15.0520 ns | 0.0734 ns | 0.0687 ns | 15.0311 ns | 14.9681 ns | 15.1906 ns | - | - | - | - |
ZeroBenchmark | 0.7316 ns | 0.1004 ns | 0.1156 ns | 0.7163 ns | 0.5323 ns | 0.9691 ns | - | - | - | - |
EqualsBenchmark | 4.0958 ns | 0.0461 ns | 0.0408 ns | 4.0859 ns | 4.0474 ns | 4.1932 ns | - | - | - | - |
GetHashCodeBenchmark | 308.9952 ns | 2.5950 ns | 2.3004 ns | 308.3442 ns | 306.2050 ns | 314.2366 ns | - | - | - | - |
AddOperatorBenchmark | 61.4049 ns | 0.0881 ns | 0.0781 ns | 61.3829 ns | 61.2802 ns | 61.5378 ns | - | - | - | - |
BitwiseAndOperatorBenchmark | 8.7401 ns | 0.0907 ns | 0.0804 ns | 8.7529 ns | 8.4814 ns | 8.8201 ns | - | - | - | - |
BitwiseOrOperatorBenchmark | 5.8710 ns | 0.0195 ns | 0.0163 ns | 5.8664 ns | 5.8459 ns | 5.8993 ns | - | - | - | - |
DivisionOperatorBenchmark | 243.7459 ns | 6.7311 ns | 7.7516 ns | 244.9533 ns | 227.9078 ns | 257.7710 ns | - | - | - | - |
EqualityOperatorBenchmark | 0.9734 ns | 0.0154 ns | 0.0136 ns | 0.9692 ns | 0.9549 ns | 1.0001 ns | - | - | - | - |
ExclusiveOrOperatorBenchmark | 5.9003 ns | 0.0257 ns | 0.0228 ns | 5.8996 ns | 5.8748 ns | 5.9447 ns | - | - | - | - |
InequalityOperatorBenchmark | 3.5203 ns | 0.0322 ns | 0.0285 ns | 3.5061 ns | 3.4928 ns | 3.5887 ns | - | - | - | - |
MultiplyOperatorBenchmark | 61.9186 ns | 0.1388 ns | 0.1159 ns | 61.9120 ns | 61.7196 ns | 62.1257 ns | - | - | - | - |
OnesComplementOperatorBenchmark | 21.6788 ns | 0.1246 ns | 0.1105 ns | 21.6766 ns | 21.5047 ns | 21.8764 ns | - | - | - | - |
SubtractionOperatorBenchmark | 60.1066 ns | 0.2208 ns | 0.1844 ns | 60.1600 ns | 59.8257 ns | 60.3687 ns | - | - | - | - |
UnaryNegateOperatorBenchmark | 67.4316 ns | 0.2463 ns | 0.2183 ns | 67.3973 ns | 67.1599 ns | 67.8942 ns | - | - | - | - |
AbsBenchmark | 10.3173 ns | 0.1574 ns | 0.1472 ns | 10.2980 ns | 10.0591 ns | 10.5487 ns | - | - | - | - |
AddBenchmark | 65.3602 ns | 0.1875 ns | 0.1662 ns | 65.3597 ns | 65.0336 ns | 65.6387 ns | - | - | - | - |
AndNotBenchmark | 22.8476 ns | 0.0983 ns | 0.0872 ns | 22.8684 ns | 22.6440 ns | 22.9391 ns | - | - | - | - |
BitwiseAndBenchmark | 10.1515 ns | 0.1625 ns | 0.1520 ns | 10.1560 ns | 9.8568 ns | 10.3946 ns | - | - | - | - |
BitwiseOrBenchmark | 8.7920 ns | 0.0685 ns | 0.0641 ns | 8.7941 ns | 8.6943 ns | 8.9188 ns | - | - | - | - |
ConditionalSelectBenchmark | 35.1013 ns | 0.0871 ns | 0.0814 ns | 35.0999 ns | 34.9701 ns | 35.2477 ns | - | - | - | - |
DivideBenchmark | 230.8056 ns | 3.8200 ns | 3.5733 ns | 231.1309 ns | 223.5282 ns | 236.5047 ns | - | - | - | - |
DotBenchmark | 106.8848 ns | 1.5138 ns | 1.4160 ns | 106.6555 ns | 104.8663 ns | 109.8214 ns | - | - | - | - |
EqualsStaticBenchmark | 38.7218 ns | 0.1433 ns | 0.1341 ns | 38.6835 ns | 38.5266 ns | 38.9829 ns | - | - | - | - |
EqualsAllBenchmark | 3.5746 ns | 0.0356 ns | 0.0316 ns | 3.5639 ns | 3.5331 ns | 3.6283 ns | - | - | - | - |
EqualsAnyBenchmark | 57.1994 ns | 0.2407 ns | 0.2134 ns | 57.1643 ns | 56.9201 ns | 57.6293 ns | - | - | - | - |
GreaterThanBenchmark | 36.8960 ns | 0.0887 ns | 0.0786 ns | 36.8852 ns | 36.7171 ns | 37.0227 ns | - | - | - | - |
GreaterThanAllBenchmark | 105.6515 ns | 0.7164 ns | 0.6351 ns | 105.6380 ns | 104.6500 ns | 106.7682 ns | - | - | - | - |
GreaterThanAnyBenchmark | 94.5897 ns | 1.4236 ns | 1.3316 ns | 94.5560 ns | 92.3444 ns | 97.3495 ns | - | - | - | - |
GreaterThanOrEqualBenchmark | 144.2161 ns | 0.3195 ns | 0.2832 ns | 144.1130 ns | 143.8592 ns | 144.8957 ns | - | - | - | - |
GreaterThanOrEqualAllBenchmark | 225.7534 ns | 0.7733 ns | 0.6457 ns | 225.8675 ns | 224.5361 ns | 226.7384 ns | - | - | - | - |
GreaterThanOrEqualAnyBenchmark | 214.1536 ns | 1.0426 ns | 0.9752 ns | 213.6999 ns | 213.0955 ns | 216.1802 ns | - | - | - | - |
LessThanBenchmark | 34.9333 ns | 0.1168 ns | 0.1036 ns | 34.9611 ns | 34.7754 ns | 35.1236 ns | - | - | - | - |
LessThanAllBenchmark | 162.8080 ns | 1.7153 ns | 1.6045 ns | 162.2848 ns | 160.3714 ns | 165.8838 ns | - | - | - | - |
LessThanAnyBenchmark | 44.2883 ns | 0.4552 ns | 0.4258 ns | 44.3683 ns | 43.1221 ns | 44.9381 ns | - | - | - | - |
LessThanOrEqualBenchmark | 183.8210 ns | 0.4592 ns | 0.4071 ns | 183.8070 ns | 183.3386 ns | 184.5908 ns | - | - | - | - |
LessThanOrEqualAllBenchmark | 401.9115 ns | 8.9488 ns | 10.3055 ns | 402.8547 ns | 384.7896 ns | 423.1362 ns | - | - | - | - |
LessThanOrEqualAnyBenchmark | 265.0211 ns | 3.5067 ns | 3.2801 ns | 264.9297 ns | 260.6478 ns | 270.4116 ns | - | - | - | - |
MaxBenchmark | 40.3965 ns | 0.1073 ns | 0.1004 ns | 40.4098 ns | 40.1970 ns | 40.5682 ns | - | - | - | - |
MinBenchmark | 39.0784 ns | 0.1081 ns | 0.0959 ns | 39.0810 ns | 38.9268 ns | 39.2049 ns | - | - | - | - |
MultiplyBenchmark | 63.0806 ns | 0.2723 ns | 0.2547 ns | 63.0654 ns | 62.6993 ns | 63.6056 ns | - | - | - | - |
NegateBenchmark | 63.1924 ns | 0.1658 ns | 0.1550 ns | 63.2034 ns | 62.9797 ns | 63.5404 ns | - | - | - | - |
OnesComplementBenchmark | 22.2950 ns | 0.1434 ns | 0.1271 ns | 22.3171 ns | 21.9820 ns | 22.4744 ns | - | - | - | - |
SquareRootBenchmark | 129.6186 ns | 0.5511 ns | 0.4885 ns | 129.5628 ns | 128.9000 ns | 130.5883 ns | - | - | - | - |
SubtractBenchmark | 66.1447 ns | 0.3725 ns | 0.3484 ns | 66.1251 ns | 65.3131 ns | 66.5934 ns | - | - | - | - |
XorBenchmark | 9.4515 ns | 0.1190 ns | 0.1113 ns | 9.4744 ns | 9.2758 ns | 9.6036 ns | - | - | - | - |
Thanks for collecting the data. I am supportive of your proposal to simplify the template.
I think that removing the T4 templates is the right thing to do, but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.
but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.
I'll take a look and see if adding the loop unrolling support is trivial. From what I recall on doing it for Vector128<T>.Count
, it might just be treating Count
as intrinsic and ensuring the right flag is set.
@tannergooding - that would be good, but perhaps that could also be done as a "cleanup" PR that includes the work to figure out why the HFA classification requires the Register type.
So we have the SIMDHandlesCache which is set based on matching the name of the generic type: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330.
However, we then map that later based on just the simdBaseType
:
- https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.cpp#L17354-L17356
- https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.h#L8001-L8002
- etc
However, TYP_I_MPL
and TYP_U_IMPL
are themselves merely aliases for TYP_LONG
/TYP_INT
and TYP_ULONG
/TYP_UINT
, respectively; so the caching logic breaks down: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/vartype.h#L35-L50
Is there some way to get the handle for Vector<ulong>
/Vector<uint>
when we encounter a Vector<nuint>
so the cache won't be invalid during later lookups or is there someway to track that a given TYP_ULONG
is actually a TYP_U_IMPL
(existing support that is)?
CC. @CarolEidt, @dotnet/jit-contrib
@jkotas, is there currently a JIT/EE method that can be used to resolve a handle? This would end up being "pay for play" as it would only be resolved once and only if a Vector<nint>
or Vector<nuint>
was the used handle.
I was also looking at finishing connecting the m_layout
field as an alternative, but there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle.
Does getTypeInstantiationArgument
method on JIT/EE help?
I'm not really a fan of the SIMDHandlesCache
, and I believe we'd be better off extending the ClassLayout
to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache
with a lookup in the ClassLayoutTable
.
there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle
That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout
would be the way to address that. I confess I haven't investigated this in detail, but it doesn't seem like there's any other reasonable way to get class info if you don't have a handle (getTypeInstantiationArgument
requires a handle).
Does getTypeInstantiationArgument method on JIT/EE help?
I don't think so. The issue is we are building the SIMDHandleCache
by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330
So what we have is a Vector`1[System.IntPtr]
, however we only track the simdBaseType
in the created node, which will be TYP_I_IMPL
, which is just an alias for TYP_LONG
This means that when we later try and get the handle back, such as in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.cpp#L17354-L17356, we will try to get a handle for Vector`1[System.Int64]
, which doesn't exist.
I was hoping there was some way to, for example, from a class handle for Vector`1[System.IntPtr]
, resolve the class handle for Vector`1[System.Int64]
. This would largely solve some of the handle lookup problems we have both for this case and cases where we are returning a type that hasn't been resolved (say for example, we internally create a node that temporarily treats a Vector`1[System.Single]
as a Vector`1[System.Int32]
-- we do this or similar in a few places).
I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.
That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that.
I think this, in general, probably needs a bit of thought.
We also have a couple issues around rewriting intrinsics into user calls in lowering (such as to better handle operands that later become constants) in which case I think we want to carry a CORINFO_METHOD_HANDLE
instead (or in addition to). Like GenTreeIntrinsic
does: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.h#L4743-L4748 and with the rewriting: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/rationalize.cpp#L265-L284)
I believe you can get the ClassLayout
for a given return type or parameter from the method handle, but I think we'd also have the issue where we may not have a method handle (such as for an intrinsic node created to handle a helper method, for example).
The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330
BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.
We can add JIT/EE interface methods that let you create new generic instantiations if it helps JIT to represent things. JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations. Otherwise, it would cause problems for AOT.
BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.
This is what getTypeInstantiationArgument
is for, correct? It looks like we are actually using that for the Vector64/128/256<T>
path: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L578...
Is it prohibitively expensive if we used it exclusively or would we still need some caching layer to help?
JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations
I think that's reasonable and don't believe we are using the handles for anything like that today.
What I'd like to do, ideally, is come to some middle ground between where we are today with the SIMDHandleCache
and where we'd like to be with the ClassLayout
info. I believe the latter is quite a bit more work, while the former would at least unblock this issue and related scenarios around accelerating Quaternion
/Plane
(which are functionally a Vector4
for most operations) and implementing the generic Vector2/3/4<T>
types.
It sounds like we might be able to do something like:
- Fixup the
Vector<T>
cache to usegetTypeInstantiationArgument
, which resolves the name lookup but keeps the mismatch - Track the
CORINFO_TYPE_*
rather than theTYP_
for thesimdBaseType
(it is trivial to go from the former to the latter)
This would allow us to track the proper base type and utilize Vector<T>
for nint
and nuint
. We could then continue the discussions around how to handle ClassLayout
and and if we can have some "synthetic" class layout like @CarolEidt suggested.
I got most of the changes done (it rounds out to +812, -722 lines, approx), but ran into a bit of "snag".
In particular JITtype2varType
(https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_*
for BYTE
, UBYTE
, SHORT
, and USHORT
all translate over to the corresponding TYP_*
. INT
, UINT
, LONG
, ULONG
, NATIVEINT
, and NATIVEUINT
all go to the signed variant (INT
, LONG
, I_IMPL
).
This isn't hard to handle, but I was wondering if someone could explain the reason for the difference here? At first glance it looks like a bug, but I'm guessing there is some reason for it.
I don't see any surrounding comments, nor comments in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/corinfo.h or https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/typelist.h
In particular
JITtype2varType
(https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, whileCORINFO_TYPE_*
forBYTE
,UBYTE
,SHORT
, andUSHORT
all translate over to the correspondingTYP_*
.INT
,UINT
,LONG
,ULONG
,NATIVEINT
, andNATIVEUINT
all go to the signed variant (INT
,LONG
,I_IMPL
).
Isn't this related to the fact that CLI models only 3 integral types on the computation stack - int32
, native int
and int64
. Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED
?
Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?
The base type is distinct from the node type and while they often match up, there are many cases where they don't.
Isn't this related to the fact that CLI models only 3 integral types on the computation stack
That was my initial thought, but byte
, ubyte
, short
, and ushort
not mapping to int
ends up not lining up with that and much of the JIT logic uses genActualType()
to convert a var_types
to the stack type, where that is relevant.
In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).
I'm not sure why this is the case. It does seem quite inconsistent, but I would bet that there's JIT code somewhere that depends on it.