Try NEON dot product for 8-32 widening.
Closed this issue · 1 comments
solidpixel commented
The current NEON code using vmovl_u8()
+ vmovl_u16()
to widen an 8-bit input to 32-bits. Using vdotq_s32()
can widen in one step, but needs some constant values in registers. Should be faster if we can afford the register pressure ...
solidpixel commented
It can widen but because of the sum across 4 values it only does that as part of another operation, such as sum of absolute differences. It would be great for an integer codec implementation, but it's not much use for us at the moment because we only want widening and not the the cross-lane summation.