gptq 4-bit quantized version
regstuff opened this issue · 3 comments
regstuff commented
Hi,
Do you guys have any plans to make a gptq 4-bit quantized version of your models. That would cut VRAM usage and improve inference speed a lot, without much loss in capabilities. A lot of other llama/alpaca models are doing this.
I'd do it myself but I don't have the kind of RAM needed for a conversion.
Thanks for this great model. Please keep going!
alxfoster commented
+1 I'd love to see this too. If helpful, I have access to a sufficiently capable machine (Ubuntu/28c/168Gb/132GbSwap/NVlink2x3090/48gb-Vram) and would be willing to provide the compute if anyone can draft a detailed guide or help w. setup to quantize the 65B LLaMa in 4-bit / 128g
lolxdmainkaisemaanlu commented
This would be really helpful!
davidliudev commented
+1. This would be great if this can be done.