Some questions
catid opened this issue · 1 comments
(1) Is there a productized version of this software for quantizing real models? I don't see any documentation on how to quantize a model.
(2) Is it incorporated into any fast inference engines so that it can be used? Or is it compatible with vLLM, exllamav2, transformers etc?
(3) How does it compare to SotA like QuiP# or AQLM? Can it be used in addition to these things?
Thank you so much for your questions.
The current repo is mainly developed for research purposes. For now, we only support LLaMa-2 models in our paper and repo. Currently, QuaRot is not integrated into fast inference engines as it is only 20 days since we published our code. We are working with some developers to integrate it into other repos.
(3) How does it compare to SotA like QuiP# or AQLM? Can it be used in addition to these things?
QuiP# and AQLM are developed for quantizing the parameters while QuaRot targets the whole inference (with quantizing the weights, activations, and KV-Caches). We have a dedicated section in the paper for doing some experiments on weight-only cases (See Table 5 in the paper).