GPTQ-for-LLaMa

4 bits quantization of LLaMa using GPTQ GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Result

Model(LLaMa-7B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.67	8.79	7.05
RTN	4	-	6.28	9.68	7.70
GPTQ	4	1024	6.98	10.81	7.99
GPTQ	4	64	6.16	9.66	7.52
RTN	3	-	25.66	61.25	28.19
GPTQ	3	64	12.24	16.77	9.55
RTN	2	-	101940	123128	109331
GPTQ	2	64	75.28	241.18	60.79

Model(LLaMa-13B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.08	8.06	6.58
RTN	4	-	5.52	8.62	6.96
RTN	3	-	11.41	21.21	13.20

Quantizing the model requires a large amount of CPU memory. For example, quantizing a LLaMa-13b model requires 32gb, and LLaMa-33b requires more memory than 64gb.

According to the case for 4-bit precision paper and GPTQ paper, a lower group-size achieves a lower ppl(perplexity). Therefore, a group-size lower than 128 is recommended.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

As with GPTQ, I confirmed that it works well even at surprisingly low 3 bits.

Requires a n-bit cuda kernel to increase speed and reduce memory.

Dependencies

torch: tested on v1.12.1+cu113
transformers: tested on v4.27.0.dev0(required)
datasets: tested on v2.10.1

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMa

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64

To run other LLaMa models replace llama-7b-hf with one of: llama-13b-hf, llama-30b-hf, llama-65b-hf.

ZeroShot

See zeroShot/ folder.

3-bit CUDA Kernels

This is an experimental feature. Haven't tested to see if it works yet.

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 3-bit LLaMa-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 3 --save llama7b-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --load llama7b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py decapoda-research/llama-7b-hf c4 --benchmark 128

Please note that GPTQ 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.

Acknowledgements

This code is based on GPTQ Thanks to Meta AI for releasing LLaMa, a powerful LLM.

stanleyjacob/GPTQ-for-LLaMa