Question about the Quantization process

Hi Lorenzo thanks for the paper it was a good read! I'm actually trying to implement quantization pipeline on a trained model and I was hoping to refer to the compression pipeline you have implemented in this paper, mainly QAT followed by Quantization and Entropy Coding.

I was hoping to get some of your inputs in how I could achieve this! Thanks a lot!

Hello, thanks for your interest! Most of the quantization code can be found in https://github.com/aegroto/nif/blob/master/compression/__init__.py. Basically, the original floating point values are normalized and then quantized to 8-bit integers in the range [-128, 127].

This quantization is used in QAT here

nif/phases/qat.py

Line 34 in aac23fd

def simulate_quantization(model, quantization_config):

. This is a very naive implementation of QAT as the weights are substitute straight with their quantized counterparts. A simple yet effective way to obtain better results is to add quantization residual as noise before the linear pass, which is the approach I have adopted in more recent code, such as in the fairseq library: https://github.com/facebookresearch/fairseq/blob/34973a94d09ecc12092a5ecc8afece5e536b7692/fairseq/modules/quantization/scalar/modules/qlinear.py#L88

Entropy coding is done by applying brotli on the quantized tensors casted to be numpy int8 arrays:

nif/compress.py

Line 37 in aac23fd

compressed = brotli.compress(buffer, lgwin=10)

.

I hope those tips will help you with your research. Feel free to ask more if you have any doubt.

Thanks a lot for the help Lorenzo! As you have suggested I have managed to make the naive implementation of QAT working and I'm working on the approach u have suggested which takes into account the quantization noise.