NEGEMMLowpMatrixMultiplyCore: GEMMLowpOutputStageInfo fusing to speed up inference
eshoguli opened this issue · 1 comments
Hi guys, I'm extremelly interested to speed up int8 MatMul
inference with ARM Compute Library kernel. My model is:
graph TD;
Input1["Input
out: fp32"]
Quantise1["NEQuantizationLayer
out: signed int8"]
Input2["Input
out: fp32"]
Quantise2["NEQuantizationLayer
out: signed int8"]
MatMul["NEGEMMLowpMatrixMultiplyCore
out: signed int8"]
Input1-->Quantise1;
Input2-->Quantise2;
Quantise1-->MatMul;
Quantise2-->MatMul;
MatMul-->Result;
To make it possible I would like to use NEGEMMLowpMatrixMultiplyCore
.
I have explored examples and found that the most suitable example is https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp. As I understand GEMMLowpOutputStageInfo
is used to requantise output tensor. Unfortunately, it's standalone operation. I didn't find any example how I can requantise output tensor inside single NEGEMMLowpMatrixMultiplyCore
kernel to avoid additional memory read/write operations.
During NEGEMMLowpMatrixMultiplyCore
kernel implementation I found that the fuse is possible:
GEMMInfo gemm_info;
gemm_info.set_gemmlowp_output_stage(info);
q_res.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::QASYMM8));
qgemm.configure(&q_src1, &q_src2, nullptr, &q_res, gemm_info);
I changed a few lines of neon_gemm_qasymm8.cpp
example to get working version. The commit: eshoguli@e4e38c5. But I didn't find any details about set_gemmlowp_output_stage
in documentation and examples. So, as result, can I ask you, guys, quickly review the changes to be absolutelly sure the fuse of GEMMLowpOutputStageInfo
into NEGEMMLowpMatrixMultiplyCore
absolutelly correct?