Try NEON dot product for 8-32 widening.

Question

Try NEON dot product for 8-32 widening.

Closed this issue 2 months ago · 1 comments

The current NEON code using vmovl_u8() + vmovl_u16() to widen an 8-bit input to 32-bits. Using vdotq_s32() can widen in one step, but needs some constant values in registers. Should be faster if we can afford the register pressure ...

Answer 1 · 2024-08-09T07:38:09.000Z

It can widen but because of the sum across 4 values it only does that as part of another operation, such as sum of absolute differences. It would be great for an integer codec implementation, but it's not much use for us at the moment because we only want widening and not the the cross-lane summation.