4 bits quantization of LLaMa using GPTQ GPTQ is SOTA one-shot weight quantization method
This code is based on GPTQ
Model(LLaMa-7B) | Bits | group-size | Wikitext2 | PTB | C4 |
---|---|---|---|---|---|
FP16 | 16 | - | 5.67 | 8.79 | 7.05 |
RTN | 4 | - | 6.28 | 9.68 | 7.70 |
GPTQ | 4 | 1024 | 6.98 | 10.81 | 7.99 |
GPTQ | 4 | 64 | 6.16 | 9.66 | 7.52 |
RTN | 3 | - | 25.66 | 61.25 | 28.19 |
GPTQ | 3 | 64 | 12.24 | 16.77 | 9.55 |
RTN | 2 | - | 101940 | 123128 | 109331 |
GPTQ | 2 | 64 | 75.28 | 241.18 | 60.79 |
Model(LLaMa-13B) | Bits | group-size | Wikitext2 | PTB | C4 |
---|---|---|---|---|---|
FP16 | 16 | - | 5.08 | 8.06 | 6.58 |
RTN | 4 | - | 5.52 | 8.62 | 6.96 |
RTN | 3 | - | 11.41 | 21.21 | 13.20 |
Quantizing the model requires a large amount of CPU memory. For example, quantizing a LLaMa-13b model requires 32gb, and LLaMa-33b requires more memory than 64gb.
According to the case for 4-bit precision paper and GPTQ paper, a lower group-size achieves a lower ppl(perplexity). Therefore, a group-size lower than 128 is recommended.
Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)
According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.
As with GPTQ, I confirmed that it works well even at surprisingly low 3 bits.
Requires a n-bit cuda kernel to increase speed and reduce memory.
torch
: tested on v1.12.1+cu113transformers
: tested on v4.27.0.dev0(required)datasets
: tested on v2.10.1
All experiments were run on a single NVIDIA RTX3090.
# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64
To run other LLaMa models replace llama-7b-hf
with one of: llama-13b-hf
, llama-30b-hf
, llama-65b-hf
.
See zeroShot/
folder.
This is an experimental feature. Haven't tested to see if it works yet.
# Install kernels
python setup_cuda.py install
# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py
# Benchmark language generation with 3-bit LLaMa-7B:
# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 3 --save llama7b-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --load llama7b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py decapoda-research/llama-7b-hf c4 --benchmark 128
Please note that GPTQ 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.
This code is based on GPTQ Thanks to Meta AI for releasing LLaMa, a powerful LLM.