google/XNNPACK

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake

xujuntwt95329 opened this issue · 1 comments

XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3, but we found that 4x16s4 kernel shows better performance on meteor lake CPU (Intel(R) Core(TM) Ultra 7 155H)

benchmark 5x16 (us) 4x16s4 (us) Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time 16193 10775 33.46
FP32MobileNetV2/T:1/real_time 8809 6626 24.78
FP32MobileNetV3Large/T:1/real_time 7756 6052 21.97
FP32MobileNetV3Small/T:1/real_time 2180 1970 9.63

Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd

We can submit PR if this is welcome.

Note that this is due to Visual C register spill. clang produces better code with 5x16