This repository contains the code for the ICLR 2023 paper GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. The current release includes the following features:
- An efficient implementation of the GPTQ algorithm:
gptq.py
- Compressing all models from the OPT and BLOOM families to 2/3/4 bits, including weight grouping:
opt.py
,bloom.py
,zeroShot/
- Evaluating the perplexity of quantized models on several language generation tasks:
opt.py
,bloom.py
- Evaluating the performance of quantized models on several ZeroShot tasks:
zeroShot/
- A 3-bit quantized matrix full-precision vector product CUDA kernel:
quant_cuda_kernel.cu
,quant_cuda.cpp
,setup_cuda.py
- Benchmarking code for individual matrix-vector products and for language generation with quantized models:
test_kernel.py
,opt.py
Together with the camera ready version of the paper we have added several updates to this repository:
- Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag
--new-eval
. - Optimized 3bit kernels, which are considerably faster especially on the A100, e.g. 1.9x -> 3.25x generation speedup for OPT-175B; can be activated via
--faster-kernel
. - A minimal LlaMa integration (for more complete features see the GPTQ-for-LLaMA repository), which demonstrates two new tricks:
--act-order
(quantizing columns in order of decreasing activation size) and--true-sequential
(performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.
Here is a summary of LLaMa results:
Wiki2 PPL | FP16 | 4bit-RTN | 4bit-GPTQ | 3bit-RTN | 3bit-GPTQ | 3g128-GPTQ |
---|---|---|---|---|---|---|
LLaMa-7B | 5.68 | 6.29 | 6.09 | 25.54 | 8.07 | 6.61 |
LLaMa-13B | 5.09 | 5.53 | 5.36 | 11.40 | 6.63 | 5.62 |
LLaMa-30B | 4.10 | 4.54 | 4.45 | 14.89 | 5.69 | 4.80 |
LLaMa-65B | 3.53 | 3.92 | 3.84 | 10.59 | 5.04 | 4.17 |
Here is a sample command:
python llama.py LLAMA_HF_FOLDER c4 --wbits 4 --true-sequential --act-order --new-eval
The --act-order
heuristic also dramatically improves accuracy on the OPT-66B outlier model: 9.55 to 9.34 and 14.16 to 9.95 PPL on Wiki2 for 4bit and 3bit, respectively.
torch
: tested on v1.10.1+cu111transformers
: tested on v4.21.2 (the LLaMa integration currently requires a main install from source andsentencepiece
)datasets
: tested on v1.17.0- (to run 3-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.4)
All experiments were run on a single 80GB NVIDIA A100. However, most experiments will work on a GPU with a lot less memory as well.
# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 [--groupsize 1024]
To run other OPT models replace opt-125m
with one of: opt-350m
, opt-1.3b
, opt-2.7b
, opt-6.7b
, opt-13b
, opt-66b
.
For the 175B-parameter mode, you have to request access from Meta and then convert it to a local HuggingFace checkpoint using their scripts in metaseq
.
Once you have such a checkpoint, simply pass its path instead of facebook/opt-125m
.
# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 [--groupsize 1024]
To run other BLOOM models replace bloom-560m
with one of: bloom-1b1
, bloom-1b7
, bloom-3b
, bloom-7b1
, bloom
.
See zeroShot/
folder.
# Install kernels
python setup_cuda.py install
# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py
# Benchmark language generation with 3-bit OPT-175B:
# OPT175B denotes the name of the folder with the HuggingFace OPT-175b checkpoint (see above)
# Save compressed model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --wbits 3 --save opt175-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --load opt175b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python opt.py OPT175B c4 --benchmark 128
Please note that our 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.
If you found this work useful, please consider citing:
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}