guzba/nimsimd

Future of Nim-simd

Asc2011 opened this issue · 4 comments

Hello Ryan,

first of all - a BIG THX for this lib. After some video lectures i started with SIMD-stuff and found nimsimd a perfect start. AVX512 is missing, but state-of-the-art AVX2 surely has the broadest audience - even on my side i'm not AVX512 ready, yet :)
But looking at what others achieve with SIMD - looking at Prof. Lemire & friends - one might get the impression, that AVX512 is 'just around the corner' :)
Naturally i started doing wrappers around the horrible intrinsics, trying to improve my and the user-experience. I will do some examples e.g. a sorting-network, string-conversion, base64 en-/decoding, string-lookup etc.
Now i came across Agner Fogs C++ Vector-Class-Library (VCL]. And from their description i learned they do sophisticated macro-/template-stuff so that one can use e.g. AVX512, though the actual hardware does not support it. And that made me think - "Shall i try nim-wrapper for VCL ?" and then make that lib more accessible thru Nim or continue to work with nim-simd ? Or would it make more sense to do exactly the same with Nim- macro-magic => a VCL rewrite (which would be very much above my pay-grade) ?
So what are your plans with nim-simd and maybe you can find the time to take a look at VCL ?
I could need some advice on how to make smth. that encourages more Nim-users to enter the SIMD journey and make this world a more efficient place 8=)).
Fun aside, i believe there is much to gain for Nim in terms of attractivness for new-users beeing able to tryout a technology like SIMD - since Python and even Javascript started adopting it. But maybe even the std-lib could gain here and there ?

beats & greats, Andreas

As a side note : beef already provided a concept & macro to check array-alignment - and he expressed some interest in SIMD-accelerated string lookups...
And i have some good material (papers/videos/gh-code/etc.) in my Zotero, which might fit well in the nim-simd-wiki ?

guzba commented

Hello! I think you may not love my answer here unfortunately. I personally dislike and will basically never use any of the various SIMD hiding wrappers out there so, given that, I don't intend to have nimsimd be anything other than intrinsics.

I would argue that if you take a little time to just learn the intrinsics / instructions and start with some simple things and go up from there, you'll end up having a true understanding of how to best use SIMD and how to lay out the data your program to best leverage it. I find the wrapper libraries end up basically being DSLs that one learns that don't really end up teaching anyone anything about what's actually going on at the CPU level. Is this knowledge valuable? I think so (and can prove so if performance is the only goal) but certainly this is not true all of the time.

That being said, a library that offers a SIMD-hiding API could certainly use nimsimd for intrinsics internally, it may not be my cup of tea but that's just my personal choice after all.

Re: AVX-512, I just haven't gotten around to adding them as I do not have a CPU that can run them yet anyway. I'm sure I'll grab a Zen 4 chip someday though. My little bit of knowledge of the new instructions make me very optimistic about how much it will improve writing amd64 SIMD (I found working with the ARM SIMD intrinsics much better).

Hello! I think you may not love my answer here unfortunately. I personally dislike and will basically never use any of the various SIMD hiding wrappers out there so, given that, I don't intend to have nimsimd be anything other than intrinsics.

ic, so that one can stay as close to 'the metal' as possible. Understood - what about using the GH-WIKI to collect some interesting SIMD-resources ?

I would argue that if you take a little time to just learn the intrinsics / instructions and start with some simple things and go up from there, you'll end up having a true understanding of how to best use SIMD and how to lay out the data your program to best leverage it. I find the wrapper libraries end up basically being DSLs that one learns that don't really end up teaching anyone anything about what's actually going on at the CPU level. Is this knowledge valuable? I think so (and can prove so if performance is the only goal) but certainly this is not true all of the time.

Getting anything-SIMD to run comes with a high barrier. That is first-of-all the evergreen of (mis)-naming things in a expressive, precise and consise manner. IMHO that was not a high-prio on the Intel-side. To me it seems to be a funny mix-of-terms from different areas like mathematics, CS, electronics and logic, all nicely mixed up. The NEON-vocabulary - which i have not looked at yet - is said to be a tad more consistant and maybe even richer in features, judgeing from some SO-questions "In NEON i can do this and that - howto in AVX2" - "you can't :), but here come the workarounds."

TBH either with or without DSL if one is aiming at performance - and surely its the most attractive feature - one would need good measurement tools and finally have to take a look at the assembly. Thats why the SIMD-examples i found are kept very concise - and i like that style.

That being said, a library that offers a SIMD-hiding API could certainly use nimsimd for intrinsics internally, it may not be my cup of tea but that's just my personal choice after all.

Well since i'm in the hot-learning-phase right now - my idea is to do a supporting SIMD-lib for 'absolute beginners' - like me - that supports the learner in thinking vertically and finding new solutions to problems that are well-solved via scalar-methods. Thats the real challenge and it requires tabled-outputs of register-states, in slow-motion, just so as to get an idea whats going on inside the registers.
And consistent naming, too - why are there 5-13 intrinsics for doing the same thing (SSE/AVX/2) ?
This "Adventures in SIMD-Thinking"-video by Bob Steagall describes it well. He needs a 'rotate'-intrinsic - and i thought - sure that will be included, would'nt it ? - to make a rotation between two registers possible with AVX/2 took me two days in full-swing :)
So there remains the conflict between - staying as simple as possible - no convienience by overloads etc. or enjoying the niceties of e.g a unified load[M128i, uint16]( ptr ) that works for all ints/floats-known-to-SIMD and checks the pointer-alignment ? I'll try Nim-heavy templating and if it turns out that this manipulates results, then i'd strip-the-code-back-down to a intrinsic-only-version. But during dev one needs these comforts... I'm in the comfortable situation, that certain algos, that i'm interested in are already well researched, so there are good references around.

Re: AVX-512, I just haven't gotten around to adding them as I do not have a CPU that can run them yet anyway. I'm sure I'll grab a Zen 4 chip someday though. My little bit of knowledge of the new instructions make me very optimistic about how much it will improve writing amd64 SIMD (I found working with the ARM SIMD intrinsics much better).

ic, ARM has a strong reputation in keeping consistency in a evolving architecture. AFAIK one can still run 8-bit-assembly from the nineties on a AArch64-chip 8=))
AVX512 has'nt arrived on my side either - so no hurries here.
Just 'cos i'm so curious - i'll try to find the time to test a well-studied algo and do it with the VCL-DSL and their AVX512-intrinsics - and see what AVX/2 outcome will surface..

--
Anyhow, have a nice weekend - its close, best regards, Andreas

guzba commented

Just a little reply to share an example of when the helpers end up not being super helpful:

a unified load that works for all ints/floats-known-to-SIMD and checks the pointer-alignment ?

This sounds nice but actually is not great. Checking alignment is branching and a branch at every load is quite bad for SIMD performance (and performance is the only reason to use SIMD). Instead, an actual SIMD implementation will do scalar code until aligned, then just do all-aligned SIMD from there. Or it won't care about alignment and do all unaligned which will still be faster than checking for alignment each load.

These little things really do end up mattering and make many helper things that sound nice actually not very useful in practice.

Just a little reply to share an example of when the helpers end up not being super helpful:

a unified load that works for all ints/floats-known-to-SIMD and checks the pointer-alignment ?

This sounds nice but actually is not great. Checking alignment is branching and a branch at every load is quite bad for SIMD performance (and performance is the only reason to use SIMD). Instead, an actual SIMD implementation will do scalar code until aligned, then just do all-aligned SIMD from there. Or it won't care about alignment and do all unaligned which will still be faster than checking for alignment each load.

ic, my idea goes more in the direction of a workbench for developing and testing algorithms and provide as much static-analysis (range-test, hints) as possible. The aligned/unaligned loads/stores will trigger warnings or give hints, not more. But e.g. there are two flavours of unaligned loads loadu and lddqu and following the Intel-guide here, one should generally prefer lddqu - so one gets loadu in nim -d:debug and lddqu during nim -d speed or maybe danger. A unaligned-load via load segfaults anyway - here i'd just provide a nicer error-msg. And vice-versa maybe a hint in case a unaligned load was logged, but the ptr had proper alignment.

These little things really do end up mattering and make many helper things that sound nice actually not very useful in practice.

I'd like to make it a two-pass flow. During pass-one everything is in slow-motion and any intrinsic-call can be logged and its effects can be visualized. Any register can be dumped and looked into at any time.
I won't wrap the provided types or change the names of the intrinsics. By logging one can exactly reproduce the functional-flow. And one can calculate the (theoretical) cycle-costs from the intrinsics-guide and the call-log. When i imagine to rework a known algorithm, say a Sorting-network, i'd know which performance (cycles) can be expected. Once expectation (theoretical-cycles) differ from the real-world results, well then there might be conceptually smth. wrong.
For this line of work, i need some simple graphical output - maybe drawing register- boxes with Markdeep - i'll see. So you see - its more geared towards a educational-thing with some emphasis on code-readabillity so __mm256_i64gather_epi32() is no-no, maybe gather[M256i, int32]()
AVX512 seems to be a heavy API with lots of new stuff.
As always when one dives into a new field there is so much to sort-out. And keeping it simple is always the right approach. I guess once i've got the helpers running, i've memorized most of these intrinsic/enigmatic terms anyway.
So in pass-two its about speed and then - ideally- nothing of the helpers-code should remain in the build..
cheers, Andreas