Why the quantizated model use the float weights during the inference stage?

Question

Why the quantizated model use the float weights during the inference stage?

qifeng22 opened this issue 7 months ago · 6 comments

x_quant = torch.clamp(x_int + self.zero_point, 0, self.n_levels - 1)
x_float_q = (x_quant - self.zero_point) * self.delta

Answer 1 · 2024-04-18T14:30:49.000Z

@jiawei-liu1103
This is Class AdaRoundQuantizer(nn.Module)

Answer 2 · 2024-04-23T03:57:36.000Z

Hi, when the reconstruction is completed, we will set the "soft_targets" of weight quantization to be False, so "x_int" and is the value after quantization. "x_float_q" is indeed a floating-point number but it is a value after dequantization, this operation is called fake quantize.

Answer 3 · 2024-04-23T04:11:46.000Z

Does the inference accuracy fake quantize equal to the real on-device or int quantization? If yes, Have any reference about the theory analyze of fake quantization and real quantization?

Answer 4 · 2024-04-23T04:12:10.000Z

@jiawei-liu1103

Answer 5 · 2024-04-23T06:28:42.000Z

The process of fake quantize is "float->int->float", usually the first step "float->int" already simulates the loss caused by quantization (truncation error, rounding error). However, if the real hardware only supports operations between integers, the results of fake quantize should be better than that of pure integer quantization. You can just google the terms "fake quantize" and "dequantize" to understand fake quantize and real quantize.

Answer 6 · 2024-04-23T06:30:00.000Z

okok， thank you