jakc4103/DFQ

Compatibility with other models

Closed this issue · 8 comments

Hi, thank you for sharing your work.

I am trying to test this out on the ResNet model.
I am using the pre-trained resnet18 on torchvision. Without any options, the model will reach its reported accuracy as expected, but with the equalization and quantization, its accuracy drops to 0.1%.

Is this an expected behavior due to specific methods in the implementation? If so, would you kindly help me out in fixing this?

Thanks.

That's probably related to activation value range. I found that the activation value range for quantization is not good enough using the method in DFQ (approximation using bn stats).
I've updated repo to add distilled data from ZeroQ, which is much more stable in calculating value range.

I tested resnet18 from torchvision, get the following results:
FP32: 69.76%
Int8* ReLU+LE: 0.1%
Int8* ReLU+LE+Distill: 68.84%

Thank you for the prompt reply.
I am still confused as to why the accuracy suffers so much in ResNets, because the paper reports in Table 5 that 69.2% can be reached just with per-layer post quantization.

I will look into the implementation and update if anything interesting comes up.

Thanks once again.

Another reason might be the current implementation did not properly handle pooling layer. I'll look into it.

@bangawayoo
I found the bug in function set_quant_min_max; should be ok now for resnet models.

Many thanks for your help!

Something still seems to be off. I am having difficulty reproducing the result of the paper.
Below is the result for pretrained ResNet18 model:

No option (FP32) : 69.758%
--quantize --equalize --relu --correction --clip_weight : 69.218%
--quantize : 69.312%

May I also inquire you about which quantization method is the implemented? Per-layer or per-channel?

Thanks.

All quantization is per-tensor wise. See Here.

I ran resnet18 with
--quantize --relu --equalize: 69.43%
For bias correction, I think its reasonable not producing better result on resnet18 comparing to mobilenetv2, since it does not biased from quantization that much. And the correction is only per-channel-wise correction, not per-channel-per-location-wise.

Honestly, I don't think my implementation gonna 100% reproduce the results from paper for 3 reasons:

  1. I use fake-quantization. In inference phase, it might have some numerical-wise difference between int8 fixed-point arithmetics.
  2. It involves many implementation-wise choices. ex: how to compute expectation from element-wise add and concat layer.
  3. I have not found the way to make bias absorption (eq. 15 from paper) work.

Anyway, I'll keep looking into it to see if I can improve it further.

Thank you very much for your explanation. I will close this issue as the main problem seems to be resolved.