nfrechette/rtm

vector_neg by mask

Closed this issue · 5 comments

niello commented

It is more a question than a suggestion. Sometimes I need to negate only certain components of the vector. Now this can be achieved with multiplication:

constexpr rtm::vector4f SignMask{ -1.f, 1.f, -1.f, 1.f };
const rtm::vector4f Result = rtm::vector_mul(Input, SignMask);

But I wander if using the mask and XOR is a more optimal way. vector_neg prefers this over multiplication for some reason. If this is really better, can you add vector_neg variation with user defined mask please?

In terms of codegen, the optimal code depends on the architecture.
Arm NEON has an intrinsic and instruction that negates each element. This is a single instruction with only a single dependency on the input. There is also a variant for 2D negation and 1D scalar negation. When only certain components need negation, scalar negation often wins with ARM (see quat_conjugate). This leverages pipelining of the negate instructions that can happen in parallel to some extent. That means that the surrounding code determines the optimal instructions to use (see quat_mul for another example).

With SSE2 on x86 and x64, XOR is always faster than FMUL. It tends to be 1 cycle or less which is hard to beat. Both instructions will depend on the input and the constant which can be loaded directly from memory in a single instruction. The only case where FMUL might be as fast or faster is if you can avoid the constant load (e.g. say the sign is available in +-1.0f form ready for you to use through some other logic that generates it.

In practice, if you want to keep the code clean and consistent, you can stick to float multiplication. XOR is a better choice, on Arm as well (compared to FMUL). Both are entirely acceptable and portable between the two.

Things become a bit more tricky if you need to build a selection mask first (to use vector_select). With a dynamic mask, you can't use scalar negation with ARM, you'll have to use XOR/FMUL. That can be done entirely with the help of constants and vector_select and vector_xor or vector_mul like you have now (more or less).

Here, RTM can't really do a much better job than you can if the mask is dynamic and not known ahead of time. For static masks, we might need many variants like vector_neg_x, vector_neg_xy, vector_neg_xz, etc. Or some variant that takes in 4 template arguments for negate/don't negate (sorta like vector_mix). Are you suggesting the later?

niello commented

Thanks for such a detailed answer. My initial concern was about multiplication, and XOR is what I was looking for, but I didn't notice that RTM provides it as vector_xor. But now speaking of these subtle performance differences I think that a vector_neg<...> with components could be useful in a critical code. In my case negation masks are known in advance.

What of vector_neg_x, vector_neg_xy etc, it is convenient but I noticed that you dropped vector_mix_xyab etc. If this was for a reason then negation should not have these shortcuts too.

BTW, I solved this inconvenience for myself in the following manner so anyone can do the same for negations:

#define RTM_MIX_ALIAS_ONE(c1, c2, c3, c4) \
RTM_DISABLE_SECURITY_COOKIE_CHECK RTM_FORCE_INLINE rtm::vector4f RTM_SIMD_CALL vector_mix_##c1##c2##c3##c4(rtm::vector4f_arg0 input) { return rtm::vector_mix<rtm::mix4::c1, rtm::mix4::c2, rtm::mix4::c3, rtm::mix4::c4>(input, input); }
#define RTM_MIX_ALIAS_TWO(c1, c2, c3, c4) \
RTM_DISABLE_SECURITY_COOKIE_CHECK RTM_FORCE_INLINE rtm::vector4f RTM_SIMD_CALL vector_mix_##c1##c2##c3##c4(rtm::vector4f_arg0 input0, rtm::vector4f_arg1 input1) { return rtm::vector_mix<rtm::mix4::c1, rtm::mix4::c2, rtm::mix4::c3, rtm::mix4::c4>(input0, input1); }

RTM_MIX_ALIAS_ONE(x, y, x, y);
RTM_MIX_ALIAS_TWO(x, y, z, a);

vector_neg<..> seems reasonable and indeed useful. I'll see if I can add that in the next release. I'm not sure what I might call it though. I imagine that using the same name may not please the compiler (but I'll try anyway just to see). Suggestions welcome :)

Thank you for the snippet! The main reason why I haven't introduced a variant of the form vector_mix_xyxy is that all solutions I've found for it aren't all that great IMO:

  • We can use macros but that makes discovering the API very hard as the functions definitions end up somewhat hidden and things like intellisense comments may or may not show up properly in IDEs
  • We can use a python (or other) script to generate a header with all permutations, lowering the maintenance burden and streamlining intellisense/IDE support somewhat
  • There is a very large number of permutations and supporting all of them means that a large number of functions must be defined which could significantly slow down compilation in every cpp that includes it. Because RTM is a very low level and core library, it is likely included in a lot of files and any changes to compilation performance can be disproportionate.

HLSL supports it at the compiler level which avoids these issues and replicating the same syntactic sugar in C++ comes with these costs. In particular, there are many things beyond just vector_mix where I'd personally like to be able to use the same trick to improve code readability but in doing so, the compilation cost would likely become prohibitive. And so, for me, it becomes an all or nothing kind of choice in order to keep the API consistent throughout the library (where possible). Separate headers could be used for this, exposing this functionality only piecewise and giving the choice to users whether or not to pay the price for it. For that reason, it remains on my backlog (although perhaps not here on github, I keep a separate todo list on trello) and something I think about from time to time. Feedback much welcome on that front.

The compilation cost is also not to be underestimated... I had to split the vector_mix unit tests into multiple cpp files for that very reason. It allows parallel compilation. Without it, a single cpp file that generates all the permutations with macros takes far too long to compile. A good first step might be to see if generating the permutations with a python script instead might speed up compilation (by avoiding the pre-processor). If that proves viable, then the solution might be workable. Otherwise I doubt that people will want to pay a multi-minute compilation cost in every cpp that does math :(

I spun off #191 to track that portion, I think we could make it a C++20 feature and it might be viable as a module, we'll see. No ETA though.