google/gemmlowp

There is a problem about how to quantize the accumulator(int32) into uint8

sxsxsx opened this issue · 2 comments

Thank you for your contribution,I have a problem bout how to quantize the accumulator(int32) into uint8.
when I run your quantization_example.cc,
Quantized uint8 LHS matrix:
208 236 0 238
3 214 255 29
Quantized uint8 RHS matrix:
152 51 244
60 26 255
0 127 246
127 254 247
I computate the LHS matrixRHS matrix is
76002 77196 169718
16979 45468 125195
Quantized uint8 result matrix obtained by quantized multiplication:
168 115 255
0 66 151
In your paper,you said that "The down-scaling corresponds to multiplication by the multiplier M in equation (7)",but how to quantize
76002 77196 169718
16979 45468 125195
into
168 115 255
0 66 151
the quantized_multiplier is 1200097792 and the right_shift is 7 ,how to use these parameter ?
in your paper,you said "The down-scaling corresponds to multiplication by the multiplier M in equation (7). ",M := S1
S2/S3=0.0066030.007050/0.010663=0.004366, but 76002M != 168
could you tell me how to quantize the accumulator(int32) into uint8?
Looking forward to your reply, thanks a lot

What's missing here is the handling of the zero-points. Just multiplying the matrices of uint8 quantized values, does not take into account which uint8 value is used to encode the real number 0.0. To take an extreme example, imagine the case where the uint8 value 128 is used to encode the real number 0.0, and imagine that we are quantizing the multiplication of real matrices containing the value 0.0 everywhere. Then the quantized matrices would contain the uint8 value 128 everywhere, and we would compute 128128 + 128128 + ... == 32768 + 32768 + ... obtaining some large integer values, and yet that would correspond to multiplying 0.0 * 0.0 + 0.0 * 0.0 + ... == 0.0. How does that work?

In itself, the integer computation that is the quantization of our real-numbers matrix multiplication is NOT just the multiplication of the matrices of uint8 values. It is instead the multiplication of these uint8 values but FROM WHICH THE ZERO-POINTS have been subtracted.

In our paper https://arxiv.org/abs/1712.05877 , that is equation (4), while the plain multiplication of uint8 matrices is equation (9), and the connection between the two is given by equation (7) and is the subject of Section 2.3.

I am trying to implement MobileNet-V1 in opencl and I had the same doubt. I still don't understand from where I will get the scaling factors (S1 S2 S3). I got the weights from the tflite file. the tflite file has quantization values for input, output, weights, and bias. Thus
S1 = input
S2 = weights
S3 = output (activation map)
am i right?
Also For convolution i am doing SUM(uint8(input)uint8(weights)) + int32(bias)
Thus to further quantize the result to uint8 I should multiply the result with M (S1
S2/S3) right?
The tflite file gives below expression in the quantization field
Eg : quantization: -1 ≤ 0.0078125 * (q - 128) ≤ 0.9921875
Thus here 0.0078125 is the S value?