Quantization takes a very long time

Question

Quantization takes a very long time

timohear opened this issue 9 months ago · 3 comments

timohear commented 9 months ago

Using TGI or Lorax eetq quantization takes several minutes (Eg 10 minutes for Mixtral) every time the launcher is run .

As a reference bitsandbytes nf4 quant takes 1 minute.

Is there any way to store and directly load the eetq model?

Answer 1 · 2024-02-03T11:26:24.000Z

And thank you for eeqt, I've been wishing for high-speed 8-bit inference for quite some time :-)

Answer 2 · 2024-02-05T06:12:35.000Z

@timohear It is very convenient to use eetq for model saving and loading, just like this

from eetq.utils import eet_quantize
eet_quantize(torch_model)
torch.save(torch_model, "xxx_eetq.pt")
...
torch.load(torch_model)

But it has not been adapted in TGI yet, your suggestion is very useful. We can optimize and test the loading eetq model process in TGI.

Answer 3 · 2024-04-11T17:55:53.000Z

Late to the party as I'm upgrading eetq at the moment. (TGI maintainer here).

We're not going to enable pickle pytorch at all, however safetensors save definitely.
I think all we have to do is save the model like regularly and add in the config quantization_config a quant_method: eetq and that's it.