NEGEMMLowpMatrixMultiplyCore support type
zhen-jia opened this issue · 3 comments
Problem description:
I am confused by the data type supported for NEGEMMLowpMatrixMultiplyCore. I find that the example (https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp#L220) uses input data type QASYMM8
and output data type S32
. But when I read the code, I find here should tiger an error message: https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp#L792
However, I could run the example, without seeing the error message. But the condition (DataType::QASYMM8 && d->data_type() != DataType::QASYMM8
) is true. I am confused. Could you help to explain what is the data type supported in NEGEMMLowpMatrixMultiplyCore? Thanks!
Hi @zhen-jia
NEGEMMLowpMatrixMultiplyCore
is implemented using CpuGemmLowpMatrixMultiplyCore
, see details in
https://github.com/ARM-software/ComputeLibrary/blob/main/src/runtime/NEON/functions/NEGEMMLowpMatrixMultiplyCore.cpp#L65
When you call NEGEMMLowpMatrixMultiplyCore::validate()
you end up calling CpuGemmLowpMatrixMultiplyCore::validate()
which supports S32
as can be seen in https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuGemmLowpMatrixMultiplyCore.cpp#L313
You can see the data types accepted by CpuGemmLowpMatrixMultiplyCore
in
https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuGemmLowpMatrixMultiplyCore.h#L78
/** Initialise the kernel's inputs, output
*
* Valid data layouts:
* - NHWC
* - NCHW
*
* Valid data type configurations:
* |src0 |src1 |src2 |dst |
* |:--------------|:------------------|:--------|:--------------|
* |QASYMM8 |QASYMM8 |S32 |QASYMM8 |
* |QASYMM8 |QSYMM8_PER_CHANNEL |S32 |QASYMM8 |
* |QASYMM8 |QSYMM8 |S32 |QASYMM8 |
* |QASYMM8 |QASYMM8 |S32 |S32 |
* |QASYMM8 |QSYMM8_PER_CHANNEL |S32 |S32 |
* |QASYMM8 |QSYMM8 |S32 |S32 |
* |QASYMM8_SIGNED |QASYMM8_SIGNED |S32 |QASYMM8_SIGNED |
* |QASYMM8_SIGNED |QSYMM8_PER_CHANNEL |S32 |QASYMM8_SIGNED |
* |QASYMM8_SIGNED |QSYMM8 |S32 |QASYMM8_SIGNED |
* |QASYMM8_SIGNED |QASYMM8_SIGNED |S32 |S32 |
* |QASYMM8_SIGNED |QSYMM8_PER_CHANNEL |S32 |S32 |
* |QASYMM8_SIGNED |QSYMM8 |S32 |S32 |
*/
CpuGemmAssemblyDispatch
is a different class used internally in ACL to run assembly kernels.
Hope this helps.
Thanks @morgolock for the help. One more question. Pytorch adopts fused kernel (fuse GEMM and de-quantization into one assembly kernel). Actually they are using dynamic quantization QNNPACK kernel. I am wondering that if ACL has some kernel like that? If I understand correctly, this folder (https://github.com/ARM-software/ComputeLibrary/tree/main/src/core/NEON/kernels/arm_gemm/kernels) only contains general GEMMs. Correct me if I was wrong. Thanks a lot.
Hi @zhen-jia
You can find the GEMM highly optimized kernels in the folder https://github.com/ARM-software/ComputeLibrary/tree/main/src/core/NEON/kernels/arm_gemm/kernels
The quantization for is handled in https://github.com/ARM-software/ComputeLibrary/blob/main/src/core/NEON/kernels/arm_gemm/quantized.cpp#L59
Hope this helps.