ARM-software/astc-encoder

Try NEON dot product for 8-32 widening.

Closed this issue · 1 comments

The current NEON code using vmovl_u8() + vmovl_u16() to widen an 8-bit input to 32-bits. Using vdotq_s32() can widen in one step, but needs some constant values in registers. Should be faster if we can afford the register pressure ...

It can widen but because of the sum across 4 values it only does that as part of another operation, such as sum of absolute differences. It would be great for an integer codec implementation, but it's not much use for us at the moment because we only want widening and not the the cross-lane summation.