4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake
xujuntwt95329 opened this issue · 1 comments
xujuntwt95329 commented
XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3
, but we found that 4x16s4 kernel shows better performance on meteor lake
CPU (Intel(R) Core(TM) Ultra 7 155H
)
benchmark | 5x16 (us) | 4x16s4 (us) | Reduction on inference time (%) |
---|---|---|---|
FP32MobileNetV1/T:1/real_time | 16193 | 10775 | 33.46 |
FP32MobileNetV2/T:1/real_time | 8809 | 6626 | 24.78 |
FP32MobileNetV3Large/T:1/real_time | 7756 | 6052 | 21.97 |
FP32MobileNetV3Small/T:1/real_time | 2180 | 1970 | 9.63 |
Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd
We can submit PR if this is welcome.
fbarchard commented
Note that this is due to Visual C register spill. clang produces better code with 5x16