Understanding usage of System.Numerics APIs
tannergooding opened this issue · 7 comments
Hey, was pointed at this repo by some other people and just wanted to say looks awesome!
As the owner of the System.Numerics and System.Runtime.Intrinsics APIs on the .NET libraries side of things, it would be great if we could have a sync so I could better understand how you're using things here and any changes or improvements that are needed in the space.
For .NET 8, I added/improved acceleration for Vector2/3/4, added SIMD acceleration for Quaternion/Plane, and rewrote Matrix3x2 and Matrix4x4. This resulted in 8-48x perf improvements in many core scenarios.
We're also looking to add some more APIs to these types to help cover some "missing" functionality, but getting some additional input from real world use cases will help justify the work and ensure its being prioritized correctly (as well as ensuring any cases we haven't thought of yet are tracked).
Look forward to hearing from you!
Hey, was pointed at this repo by some other people and just wanted to say looks awesome!
Thanks!
As the owner of the System.Numerics and System.Runtime.Intrinsics APIs on the .NET libraries side of things, it would be great if we could have a sync so I could better understand how you're using things here and any changes or improvements that are needed in the space.
Sure! How would you want to do that?
For .NET 8, I added/improved acceleration for Vector2/3/4, added SIMD acceleration for Quaternion/Plane, and rewrote Matrix3x2 and Matrix4x4. This resulted in 8-48x perf improvements in many core scenarios.
Shoot, I get distracted by AIstuff for a few months and you're already this far into 8's cycle :P Sounds like I've got some catching up to do. I suspect there are quite a few legacy-informed decisions I'll need to revisit (in a good delete-heavy way).
@tannergooding ! i saw your work but forgot to ask you or check latety, glad you got pointed here, because this did come up,
I was going to ask for this: Vector2 Vector2/3/4, Matrix3x2and those kind of physics and graphics related , real world types to have SIMD intrinsics.. i believe Julia and Swift might have it. but im absolutely hooked on .net 7 and JIT . Unity has some more homogenous vector and matrix types., some have special extensions for Vector2i, and vector2long but ideally wouid be T with generalized SIMD like has gone nicely to Vector128 which is amazing progress, letting pages of old code to just get erased :) So even if not on 3d mabye 2D because that is what maps to pixels.
it should IMO be generalized in the core math, is practical but admittedly still niche and but trending up in use case , mabye out ofthe scope of bepu near term.. also as Ross said, late in the cyle for both i guess but , if Net 9 ... or to keep in mind, or if is already been requested enough or implementable without huge impact...
I'm also sidetracked
And burned out by the Ai copilot so skip the following long rant but pls I suggest to put eyess on Julia maths. but heres some more pitch.. for countables and conversion.
neverminding persistence and machine aggreeability on IEEE float types , in integration its important in physics sims. So limitiations of floats which most people agree are most convienient can cause some problems, one being machines don't agree on them and they can't be represented exactly. Real numbers aren't real or physical , one might even say.
https://www.josstam.com/reversibl
this kind of pendulum sim can gain energy from fp errors and rounding and explode with 5 balls in a few steps even running forward. and with his demo in js, you can easily see how perfectly reversible it is , or not, and that is just one machine.
https://arxiv.org/abs/2207.07695 if you read none of this Jos Stam is a graphics and phyiscs pioneer since 1999 and doens't pubish often ,he states it well.
but heres the additional cases i see:
Julia is a now favorite abong physics and fluid modelers , and some other languages also support rational numbers & fractionals math as well..
simulation w Eulerian grids Fluids like SemiLangrangian advection (STAM 99) on grids as instead of Lagrangian method (fields) , and if it were in .net in general, I think i will help adoption in the physics and CFD , DSP, community as well as games. Some have attemted to to unify Julia a bit with .net but without
Equivalent types its hard... or mabye academics can use use .net and C# or f# instead, because old c code ports to it easily.
Quantum gravity theorists use the types. crypography.. floating point error have crashed rockets already.
while space games with large cooridnate systems often do local frames via floats for either local physics or as a workaround, the universe state is best stored as ulong or even bigint .. But there are good reasons to use Int32 / fixed/ posit / unums homogenous math for integration steps. In large worlds often ulongs are used, the floats are taken from the data for convienience, but the data is kept as countables even for sparse fields.
The size of the word / the position gives a cooridnate.
so in summary besides the other game frameworks having it , as well as CAD. and a huge cry for 3DStudio max to store doubles because models ( drift after multiple edits) . in 1981 Autocad visuary founder John Walker decided on doubles for at database level, ( floats at display list level) or civil
-scale designs would have drifted after enough edits.
1 . determinism and reversiblity , for multiplayer, for energy conservation, for Eulerian fluid math, and for distributed comuting. DSP and fixed points maths are standard for this stuff. the issues with IEEE floats are just unsolvable and doubles are generally too big and the issues remain.
-
Other problems are convergenge of integators, substepping and having to use memory and take snap shots if substepping too far. with reversible integrators u just -dt and back up.
The problem is that floating point numbers can easily go chaotic with hard constraints and are not reversible or deterministic amount machines. So we damp simple discretized wave ocsillators like a height field
with 0.0000001 or they explode.. So for conveinience floats or fixed are taken from the data for conveinience, but the data is kept and persists as whole numbers. https://en.wikipedia.org/wiki/Finite_fieldso it is my wish more of academia start to use .net.
Sorry about the typos and edits.if homogenous cooridnates systems are accommodated I'm sure i
It's fine. Too too big a topic ive seen the internals.. and I have not even brought up polar coordiantes.. you might ask Steven Wollfram, he's gone full circile on entropy after 50 years..I'm sure he'll have an opinion.. sorry ill stop I'm clearly mad from all the new stuff to learn
Not sure if BepuPhysics would benefit from it but I do miss a Matrix4x3, because it's exactly what's needed to represent a model in 3D space, on a matrix 4x4 the 4th column is usually left to be 0,0,0,1. So a 4x3 would mean less bandwidth, and I guess it could be further optimized
@tannergooding does numeric.vectors have a specific place for discussion? I also use vectors extensively but don't want to steal discussion
Sure! How would you want to do that?
@RossNordby. That's really up to you. We could setup something informal over Teams or Discord, we could just have an async discussion on GitHub here, or something else. In general whatever is easiest on you all.
but ideally wouid be T with generalized SIMD like has gone nicely to Vector128 which is amazing progress
I've looked at providing a Vector2/3/4<T>
, Matrix4x4<T>
, etc. We even had an approved API surface, but I "de-approved" it because generic math became a thing before we could get it all implemented and so I want to revisit it and update it to better take advantage of generic math before it ships.
For supporting floating-point T
, that's easy. For supporting integer T
, that's a bit harder since you then have to consider how things like Length
work given they involve a square root (and so at best can round or truncate the result).
There is also the consideration that accelerating float/double
or int/uint
is easy. But accelerating long
/ulong
or byte/sbyte
is quite a bit harder since the support can be 64-bit limited or won't fill an entire vector.
So limitiations of floats which most people agree are most convienient can cause some problems, one being machines don't agree on them and they can't be represented exactly. Real numbers aren't real or physical , one might even say.
IEEE 754 floating-point as spec'd is deterministic, if nothing else.
The main issue is that many scenarios, particularly games, compile with features like "fast-math" which tells the compiler "I don't care about deterministic behavior, give me speed instead". The other of which is that many core math functions (such as sin
/cos
) aren't "required operations" by the IEEE 754 spec, and so many runtimes allow a little bit of precision loss in favor of speed here. For most cases this is fine, but if you need determinism then it means two different machines can produce different results.
The other issue is that floating-point is an approximation in general, so even if deterministic, there is natural error that is introduced and which must be handled. You therefore cannot thing of things in terms of "regular math", but instead must modify the domain to explicitly account for this error. There are many tricks and ways this can be done, including without losing significant performance, but it is in general something that must be accounted for.
the issues with IEEE floats are just unsolvable and doubles are generally too big and the issues remain.
They certainly aren't unsolvable. Just requires a little bit of math to understand the limits and fit something into the general model that works.
Using something with more precision, like double
, can help. Same as falling back to a general model that works around integers or similar when at scale, but there are also many ways this can be made to work using purely floating-point.
The best approaches tend to be a little bit of each, so you have the right balance between speed and accuracy. -- Keep in mind that in practice, no one actually works in infinite precision. NASA uses 16-digits for PI and you need something
like around 40 digits of PI to compute the circumference of the observable universe to the width of a hydrogen atom, etc.
The main consideration is that of the 4 billion representable 32-bit floating-point values, ~50% of them exist in the domain of -1
to +1
(inclusive). The remaining approx. 50%
exist between +1
and float.MaxValue
and between -1
and float.MinValue
, respectively. There is then a small amount used to represent NaN
and subnormal
(sometimes called denormal
) values. Then also that the upper bound for representing any fractional data is 2^23
and then 2^24
for representing accurate integral data.
This leads to several commonly used solutions for handling things both efficiently and correctly to account for the precision/domain limitations without losing perf or overall accuracy.
does numeric.vectors have a specific place for discussion? I also use vectors extensively but don't want to steal discussion
Depends on what the discussion is about. In general we allow discussion threads to be opened on https://github.com/dotnet/runtime/discussions
You can also open up API proposals there as well: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.yml&title=%5BAPI+Proposal%5D%3A+ - Just keep in mind we have a general process and following the template helps ensure all the information we need for API review exists. If the person opening the proposal doesn't fill it out, then the .NET team has to instead, and that generall means it doesn't get reviewed or considered as quickly.
I am personally also on Discord for both the C# Community and the .NET Evolution servers. The former is for general discussion and the latter is more for working on dotnet/runtime
(or other repos) themselves. I spend a good bit of time in each answering questions and generally interacting with the community.
@RossNordby. That's really up to you. We could setup something informal over Teams or Discord, we could just have an async discussion on GitHub here, or something else. In general whatever is easiest on you all.
Alright, suppose I'll start here since async's easy to schedule, and we can hop elsewhere if useful.
What's being used, where, and how it's worked
The biggest use of the numerics APIs by execution time tends to be in constraint solving and narrow phase execution, which look like this:
https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics/Constraints/AngularHinge.cs#L191
https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics/CollisionDetection/CollisionTasks/BoxPairTester.cs#L412
All of these implementations are fed with bundles of packed constraints or collision pairs, and every lane of execution is fully independent.
They're heavily reliant on AoSoA-packed struct types containing Vector<T>
, like so:
https://github.com/bepu/bepuphysics2/blob/master/BepuUtilities/Symmetric3x3Wide.cs
A lot of these implementations are a bit old and were built with much older JIT versions in mind, hence the quantity of ref
/in
and lack of operators.
Historically, between the struct representation and refness causing aliasing difficulties, codegen was often extremely stack shuffley. It's improved hugely (as of my last close look in the 7 previews, probably moreso now), so newer parts of the codebase tend to use far less ref and more operators.
During my last relevant testing in the 6-7 timeframe, I did notice operators and other functions that got inlined still sometimes produced worse instructions than manual inlining, for example:
https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics/Constraints/Weld.cs#L174
(I have not yet revisited these for the latest previews; no idea if this is still relevant.)
Intrinsics have started sneaking their way into parts of the codebase as I revisit things. The gather/scatter used by constraints is a notable case:
https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics/Bodies_GatherScatter.cs#L318
This was built before the cross-platform helpers were added; I did take a quick swing at porting as much over as I could for the sake of accelerating ARM (rather than using the current rather bad fallback) but ran into some difficulties with getting efficient permutes. Unfortunately, this was several months ago, so I've mostly forgotten the details.
There are a number of other places in the codebase that would benefit from similar fast gather/scatter/transpositions, so that's probably going to see more work.
I've adopted the cross-platform helpers quite thoroughly... somewhere... but I can't find any big examples in bepuphysics2, so apparently that was in another project. They're good and I like them! I find most new intrinsics-y codechunks I write now are cross platform first with sprinkled platform specific fine-tuning as needed.
There are quite a few uses of the more traditional types like Vector3
, but they don't tend to be hot paths at the moment. A lot of those codepaths are lower quality and do some things that don't play well with vectorization, like this branchy, component-accessy raytest:
https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics/Collidables/Cylinder.cs#L76
Some of them are mostly copied from the ancient bepuphysics1 codebase.
I've also accumulated some custom types like QuaternionEx
, Matrix3x3
, and Matrix
. None of these are particularly amazing for performance- they're mostly to cover the historical gaps in vectorization or to provide some missing features. I anticipate that a lot of them are going to get deleted next time I get around to refactoring, given the recent changes.
I also anticipate using the improved non-AoSoA types more, and in more performance sensitive areas, soonishly. There are places where the wide AoSoA representation is not ideal due to extremely high divergence (e.g. hull-hull collisions); providing a fast narrow path will likely be a significant win there. I'd also like to provide certain types of queries (like those mentioned in #150) without batching (because holy moly the CollisionBatcher
isn't fun to use) which will also benefit from the narrow types.
As a side note, the interop between Vector2/3/4
and Vector128 is quite handy.
I guess in summary, I use/will use most stuff you've stuck in there, and the improvements have been really nice to have!
Additional things that could be useful
Given that so much of the library is based on executing machine-width bundles, AVX512 is a promising feature for the future (assuming Intel gets consistent support). I've seen some of the work towards this so... thumbs up, I suppose!
Otherwise, I find it a bit difficult to point out specific APIs that I want. The discrete features I've usually wanted most are those that unlock new capabilities that weren't feasible before, or which make things systematically easier, as opposed to, say, another helper method. (I'm not opposed to helper methods, of course, but I don't mind implementing them myself :P)
Vector<T>
, hardware intrinsics, and cross platform helpers for intrinsics all had this 'unlock' flavor to some significant degree. On the semirecent language side, the unmanaged constraint, static abstracts and function pointers have been extremely helpful in the same way.
I'm not immediately sure what the next most significant unlock would be (apart from the obvious ones like "supporting more instruction sets"). Maybe indirect stuff, like codegen quality- the ability to realistically use operators on largeish custom types was extremely valuable for sanity.
(And, again outside of numerics, I admit I've sometimes wanted memory aliasing hints. I'm not convinced adding such a thing would actually be a good idea, because oof, but I've thought about it.)
That's about it off the top of my head. Might remember some other things later- let me know if you've got any questions!