GPTQ

This repository contains the code for the ICLR 2023 paper GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. The current release includes the following features:

An efficient implementation of the GPTQ algorithm: gptq.py
Compressing all models from the OPT and BLOOM families to 2/3/4 bits, including weight grouping: opt.py, bloom.py, zeroShot/
Evaluating the perplexity of quantized models on several language generation tasks: opt.py, bloom.py
Evaluating the performance of quantized models on several ZeroShot tasks: zeroShot/
A 3-bit quantized matrix full-precision vector product CUDA kernel: quant_cuda_kernel.cu, quant_cuda.cpp, setup_cuda.py
Benchmarking code for individual matrix-vector products and for language generation with quantized models: test_kernel.py, opt.py

New Features

Together with the camera ready version of the paper we have added several updates to this repository:

Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
Optimized 3bit kernels, which are considerably faster especially on the A100, e.g. 1.9x -> 3.25x generation speedup for OPT-175B; can be activated via --faster-kernel.
A minimal LlaMa integration (for more complete features see the GPTQ-for-LLaMA repository), which demonstrates two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

Here is a summary of LLaMa results:

Wiki2 PPL	FP16	4bit-RTN	4bit-GPTQ	3bit-RTN	3bit-GPTQ	3g128-GPTQ
LLaMa-7B	5.68	6.29	6.09	25.54	8.07	6.61
LLaMa-13B	5.09	5.53	5.36	11.40	6.63	5.62
LLaMa-30B	4.10	4.54	4.45	14.89	5.69	4.80
LLaMa-65B	3.53	3.92	3.84	10.59	5.04	4.17

Here is a sample command:

python llama.py LLAMA_HF_FOLDER c4 --wbits 4 --true-sequential --act-order --new-eval

The --act-order heuristic also dramatically improves accuracy on the OPT-66B outlier model: 9.55 to 9.34 and 14.16 to 9.95 PPL on Wiki2 for 4bit and 3bit, respectively.

Dependencies

torch: tested on v1.10.1+cu111
transformers: tested on v4.21.2 (the LLaMa integration currently requires a main install from source and sentencepiece)
datasets: tested on v1.17.0
(to run 3-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.4)

All experiments were run on a single 80GB NVIDIA A100. However, most experiments will work on a GPU with a lot less memory as well.

Language Generation

OPT

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 [--groupsize 1024]

To run other OPT models replace opt-125m with one of: opt-350m, opt-1.3b, opt-2.7b, opt-6.7b, opt-13b, opt-66b. For the 175B-parameter mode, you have to request access from Meta and then convert it to a local HuggingFace checkpoint using their scripts in metaseq. Once you have such a checkpoint, simply pass its path instead of facebook/opt-125m.

BLOOM

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python bloom.py bigscience/bloom-560m c4 --wbits 4 [--groupsize 1024]

To run other BLOOM models replace bloom-560m with one of: bloom-1b1, bloom-1b7, bloom-3b, bloom-7b1, bloom.

ZeroShot

See zeroShot/ folder.

3-bit CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of OPT-175B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 3-bit OPT-175B:
# OPT175B denotes the name of the folder with the HuggingFace OPT-175b checkpoint (see above)

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --wbits 3 --save opt175-3bit.pt
# Benchmark generating a 128 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python opt.py OPT175B c4 --load opt175b-3bit.pt --benchmark 128
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python opt.py OPT175B c4 --benchmark 128

Please note that our 3-bit kernels are currently only optimized for OPT-175B running on 1xA100 or 2xA6000 and may thus yield suboptimal performance on smaller models or on other GPUs.

Cite

If you found this work useful, please consider citing:

@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  year={2022},
  journal={arXiv preprint arXiv:2210.17323}
}

Minghui-BD/gptq