ARM-software/ComputeLibrary

NEGEMMLowpMatrixMultiplyCore support type

zhen-jia opened this issue · 3 comments

Problem description:
I am confused by the data type supported for NEGEMMLowpMatrixMultiplyCore. I find that the example (https://github.com/ARM-software/ComputeLibrary/blob/main/examples/neon_gemm_qasymm8.cpp#L220) uses input data type QASYMM8 and output data type S32. But when I read the code, I find here should tiger an error message: https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp#L792
However, I could run the example, without seeing the error message. But the condition (DataType::QASYMM8 && d->data_type() != DataType::QASYMM8) is true. I am confused. Could you help to explain what is the data type supported in NEGEMMLowpMatrixMultiplyCore? Thanks!

Hi @zhen-jia

NEGEMMLowpMatrixMultiplyCore is implemented using CpuGemmLowpMatrixMultiplyCore, see details in
https://github.com/ARM-software/ComputeLibrary/blob/main/src/runtime/NEON/functions/NEGEMMLowpMatrixMultiplyCore.cpp#L65

When you call NEGEMMLowpMatrixMultiplyCore::validate() you end up calling CpuGemmLowpMatrixMultiplyCore::validate() which supports S32 as can be seen in https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuGemmLowpMatrixMultiplyCore.cpp#L313

You can see the data types accepted by CpuGemmLowpMatrixMultiplyCore in
https://github.com/ARM-software/ComputeLibrary/blob/main/src/cpu/operators/CpuGemmLowpMatrixMultiplyCore.h#L78

 /** Initialise the kernel's inputs, output
     *
     * Valid data layouts:
     * - NHWC
     * - NCHW
     *
     * Valid data type configurations:
     * |src0           |src1               |src2     |dst            |
     * |:--------------|:------------------|:--------|:--------------|
     * |QASYMM8        |QASYMM8            |S32      |QASYMM8        |
     * |QASYMM8        |QSYMM8_PER_CHANNEL |S32      |QASYMM8        |
     * |QASYMM8        |QSYMM8             |S32      |QASYMM8        |
     * |QASYMM8        |QASYMM8            |S32      |S32            |
     * |QASYMM8        |QSYMM8_PER_CHANNEL |S32      |S32            |
     * |QASYMM8        |QSYMM8             |S32      |S32            |
     * |QASYMM8_SIGNED |QASYMM8_SIGNED     |S32      |QASYMM8_SIGNED |
     * |QASYMM8_SIGNED |QSYMM8_PER_CHANNEL |S32      |QASYMM8_SIGNED |
     * |QASYMM8_SIGNED |QSYMM8             |S32      |QASYMM8_SIGNED |
     * |QASYMM8_SIGNED |QASYMM8_SIGNED     |S32      |S32            |
     * |QASYMM8_SIGNED |QSYMM8_PER_CHANNEL |S32      |S32            |
     * |QASYMM8_SIGNED |QSYMM8             |S32      |S32            |
     */

CpuGemmAssemblyDispatch is a different class used internally in ACL to run assembly kernels.

Hope this helps.

Thanks @morgolock for the help. One more question. Pytorch adopts fused kernel (fuse GEMM and de-quantization into one assembly kernel). Actually they are using dynamic quantization QNNPACK kernel. I am wondering that if ACL has some kernel like that? If I understand correctly, this folder (https://github.com/ARM-software/ComputeLibrary/tree/main/src/core/NEON/kernels/arm_gemm/kernels) only contains general GEMMs. Correct me if I was wrong. Thanks a lot.