gptq 4-bit quantized version

Question

gptq 4-bit quantized version

regstuff opened this issue 2 years ago · 3 comments

Hi,
Do you guys have any plans to make a gptq 4-bit quantized version of your models. That would cut VRAM usage and improve inference speed a lot, without much loss in capabilities. A lot of other llama/alpaca models are doing this.
I'd do it myself but I don't have the kind of RAM needed for a conversion.
Thanks for this great model. Please keep going!

Answer 1 · 2023-04-06T15:25:39.000Z

+1 I'd love to see this too. If helpful, I have access to a sufficiently capable machine (Ubuntu/28c/168Gb/132GbSwap/NVlink2x3090/48gb-Vram) and would be willing to provide the compute if anyone can draft a detailed guide or help w. setup to quantize the 65B LLaMa in 4-bit / 128g

Answer 2 · 2023-04-06T17:31:20.000Z

This would be really helpful!

Answer 3 · 2023-04-10T15:46:05.000Z

+1. This would be great if this can be done.