huggingface/diffusion-fast

quantization issue for weight only int8 with fp16

HDCharles opened this issue · 1 comments

bf16 or fp32 works. I tested the change_linear_weights_to_int8_woqtensors, change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int4_woqtensors (they're generally safer than the module swap methods since they dispatch on the funciton being used)

all 3 works reasonably well, int8weight only quant doesn't work for fp16 though since if both the weight and activation are fp16, you overflow the fp16 range before the rescale.

see:

https://github.com/sayakpaul/sdxl-fast/pull/2/files

Closing with #2.