Quantization of SantaCoder using GPTQ
GPTQ is SOTA one-shot weight quantization method
This code is based on GPTQ
Changed to support new features proposed by GPTQ.
- Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
- two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.
It supports act-order, but it's very slow.
SantaCoder | Bits | group-size | memory(MiB) | wikitext2 | ptb | c4 | stack | checkpoint size(MB) |
---|---|---|---|---|---|---|---|---|
FP32 | 32 | - | 4344.722 | 24.927 | 38.574 | 27.779 | 2.619 | 4394 |
BF16 | 16 | - | 2173.680 | 24.960 | 38.597 | 27.794 | 2.621 | 2195 |
GPTQ | 8 | -1 | 1396.548 | 24.936 | 38.592 | 27.785 | 2.619 | 1411 |
GPTQ | 4 | -1 | 911.384 | 26.581 | 40.717 | 29.232 | 2.658 | 913 |
GPTQ | 3 | -1 | - | 11761.473 | 7273.338 | 9124.941 | 2485.844 | 789 |
GPTQ | 2 | -1 | - | 67976.797 | 68994.484 | 73294.438 | 45370.488 | 649 |
StarCoder | Bits | group-size | memory(MiB) | wikitext2 | ptb | c4 | stack | checkpoint size(MB) |
---|---|---|---|---|---|---|---|---|
FP32 | 32 | - | 10.801 | 16.425 | 13.402 | 1.738 | 59195 | |
BF16 | 16 | - | 10.807 | 16.424 | 13.408 | 1.739 | 29597 | |
GPTQ | 8 | 128 | 10.805 | 15.453 | 13.408 | 1.739 | 16163 | |
GPTQ | 4 | 128 | 10.989 | 16.839 | 13.676 | 1.757 | 8877 |
StarCoderBase | Bits | group-size | memory(MiB) | wikitext2 | ptb | c4 | stack | checkpoint size(MB) |
---|---|---|---|---|---|---|---|---|
FP32 | 32 | - | 10.172 | 15.756 | 12.736 | 1.692 | 59195 | |
BF16 | 16 | - | 10.173 | 15.765 | 12.745 | 1.692 | 29597 | |
GPTQ | 8 | 128 | 10.174 | 15.767 | 12.739 | 1.692 | 16163 | |
GPTQ | 4 | 128 | 10.387 | 16.056 | 13.005 | 1.708 | 8877 |
Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.
Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)
According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.
If you don't have conda, install it first.
conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
python setup_cuda.py install
All experiments were run on a single NVIDIA RTX3090.
Visit mayank31398/santacoder-GPTQ-4bit-128g for the 4-bit weights. Visit mayank31398/santacoder-GPTQ-8bit-128g for the 8-bit weights.
# 4-bit
git clone https://huggingface.co/mayank31398/santacoder-GPTQ-4bit-128g
# 8-bit
git clone https://huggingface.co/mayank31398/santacoder-GPTQ-8bit-128g
Alternatively, you can also use the scripts to get the quantized models and save them to disk.
For generation use:
# fp32
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 32
# bf16
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 16
# GPTQ int8
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 8 --load santacoder-GPTQ-8bit-128g/model.pt
# GPTQ int4
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 4 --load santacoder-GPTQ-4bit-128g/model.pt
Visit mayank31398/starcoder-GPTQ-4bit-128g for the 4-bit weights. Visit mayank31398/starcoder-GPTQ-8bit-128g for the 8-bit weights.
# 4-bit
git clone https://huggingface.co/mayank31398/starcoder-GPTQ-4bit-128g
# 8-bit
git clone https://huggingface.co/mayank31398/starcoder-GPTQ-8bit-128g
Alternatively, you can also use the scripts to get the quantized models and save them to disk.
For generation use:
# fp32
python -m santacoder_inference bigcode/starcoder --wbits 32
# bf16
python -m santacoder_inference bigcode/starcoder --wbits 16
# GPTQ int8
python -m santacoder_inference bigcode/starcoder --wbits 8 --groupsize 128 --load starcoder-GPTQ-8bit-128g/model.pt
# GPTQ int4
python -m santacoder_inference bigcode/starcoder --wbits 4 --groupsize 128 --load starcoder-GPTQ-4bit-128g/model.pt
Visit mayank31398/starcoderbase-GPTQ-4bit-128g for the 4-bit weights. Visit mayank31398/starcoderbase-GPTQ-8bit-128g for the 8-bit weights.
# 4-bit
git clone https://huggingface.co/mayank31398/starcoderbase-GPTQ-4bit-128g
# 8-bit
git clone https://huggingface.co/mayank31398/starcoderbase-GPTQ-8bit-128g
Alternatively, you can also use the scripts to get the quantized models and save them to disk.
For generation use:
# fp32
python -m santacoder_inference bigcode/starcoderbase --wbits 32
# bf16
python -m santacoder_inference bigcode/starcoderbase --wbits 16
# GPTQ int8
python -m santacoder_inference bigcode/starcoderbase --wbits 8 --groupsize 128 --load starcoderbase-GPTQ-8bit-128g/model.pt
# GPTQ int4
python -m santacoder_inference bigcode/starcoderbase --wbits 4 --groupsize 128 --load starcoderbase-GPTQ-4bit-128g/model.pt
This code is based on GPTQ
Triton GPTQ kernel code is based on GPTQ-triton