Quantization takes a very long time
timohear opened this issue · 3 comments
Using TGI or Lorax eetq quantization takes several minutes (Eg 10 minutes for Mixtral) every time the launcher is run .
As a reference bitsandbytes nf4 quant takes 1 minute.
Is there any way to store and directly load the eetq model?
And thank you for eeqt, I've been wishing for high-speed 8-bit inference for quite some time :-)
@timohear It is very convenient to use eetq for model saving and loading, just like this
from eetq.utils import eet_quantize
eet_quantize(torch_model)
torch.save(torch_model, "xxx_eetq.pt")
...
torch.load(torch_model)
But it has not been adapted in TGI yet, your suggestion is very useful. We can optimize and test the loading eetq model process in TGI.
Late to the party as I'm upgrading eetq at the moment. (TGI maintainer here).
We're not going to enable pickle pytorch at all, however safetensors save definitely.
I think all we have to do is save the model like regularly and add in the config quantization_config
a quant_method: eetq
and that's it.