quantization issue for weight only int8 with fp16
HDCharles opened this issue · 1 comments
HDCharles commented
bf16 or fp32 works. I tested the change_linear_weights_to_int8_woqtensors, change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int4_woqtensors (they're generally safer than the module swap methods since they dispatch on the funciton being used)
all 3 works reasonably well, int8weight only quant doesn't work for fp16 though since if both the weight and activation are fp16, you overflow the fp16 range before the rescale.
see: