quantization issue for weight only int8 with fp16

Question

quantization issue for weight only int8 with fp16

HDCharles opened this issue a year ago · 1 comments

bf16 or fp32 works. I tested the change_linear_weights_to_int8_woqtensors, change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int4_woqtensors (they're generally safer than the module swap methods since they dispatch on the funciton being used)

all 3 works reasonably well, int8weight only quant doesn't work for fp16 though since if both the weight and activation are fp16, you overflow the fp16 range before the rescale.

see:

https://github.com/sayakpaul/sdxl-fast/pull/2/files

Answer 1 · 2023-12-14T12:52:29.000Z

Closing with #2.