project-baize/baize-chatbot

gptq 4-bit quantized version

regstuff opened this issue · 3 comments

Hi,
Do you guys have any plans to make a gptq 4-bit quantized version of your models. That would cut VRAM usage and improve inference speed a lot, without much loss in capabilities. A lot of other llama/alpaca models are doing this.
I'd do it myself but I don't have the kind of RAM needed for a conversion.
Thanks for this great model. Please keep going!

+1 I'd love to see this too. If helpful, I have access to a sufficiently capable machine (Ubuntu/28c/168Gb/132GbSwap/NVlink2x3090/48gb-Vram) and would be willing to provide the compute if anyone can draft a detailed guide or help w. setup to quantize the 65B LLaMa in 4-bit / 128g

This would be really helpful!

+1. This would be great if this can be done.