GPTQ-for-SantaCoder-and-StarCoder

Quantization of SantaCoder using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Changed to support new features proposed by GPTQ.

Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

It supports act-order, but it's very slow.

Result

SantaCoder	Bits	group-size	memory(MiB)	wikitext2	ptb	c4	stack	checkpoint size(MB)
FP32	32	-	4344.722	24.927	38.574	27.779	2.619	4394
BF16	16	-	2173.680	24.960	38.597	27.794	2.621	2195
GPTQ	8	-1	1396.548	24.936	38.592	27.785	2.619	1411
GPTQ	4	-1	911.384	26.581	40.717	29.232	2.658	913
GPTQ	3	-1	-	11761.473	7273.338	9124.941	2485.844	789
GPTQ	2	-1	-	67976.797	68994.484	73294.438	45370.488	649

Result

StarCoder	Bits	group-size	wikitext2	ptb	c4	stack	checkpoint size(MB)
FP32	32	-	10.801	16.425	13.402	1.738	59195
BF16	16	-	10.807	16.424	13.408	1.739	29597
GPTQ	8	128	10.805	15.453	13.408	1.739	16163
GPTQ	4	128	10.989	16.839	13.676	1.757	8877

Result

StarCoderBase	Bits	group-size	wikitext2	ptb	c4	stack	checkpoint size(MB)
FP32	32	-	10.172	15.756	12.736	1.692	59195
BF16	16	-	10.173	15.765	12.745	1.692	29597
GPTQ	8	128	10.174	15.767	12.739	1.692	16163
GPTQ	4	128	10.387	16.056	13.005	1.708	8877

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

pip install -r requirements.txt
python setup_cuda.py install

All experiments were run on a single NVIDIA RTX3090.

Language Generation

SantaCoder

Visit mayank31398/santacoder-GPTQ-4bit-128g for the 4-bit weights. Visit mayank31398/santacoder-GPTQ-8bit-128g for the 8-bit weights.

# 4-bit
git clone https://huggingface.co/mayank31398/santacoder-GPTQ-4bit-128g
# 8-bit
git clone https://huggingface.co/mayank31398/santacoder-GPTQ-8bit-128g

Alternatively, you can also use the scripts to get the quantized models and save them to disk.

For generation use:

# fp32
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 32
# bf16
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 16

# GPTQ int8
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 8 --load santacoder-GPTQ-8bit-128g/model.pt
# GPTQ int4
python -m santacoder_inference bigcode/gpt_bigcode-santacoder --wbits 4 --load santacoder-GPTQ-4bit-128g/model.pt

StarCoder

Visit mayank31398/starcoder-GPTQ-4bit-128g for the 4-bit weights. Visit mayank31398/starcoder-GPTQ-8bit-128g for the 8-bit weights.

# 4-bit
git clone https://huggingface.co/mayank31398/starcoder-GPTQ-4bit-128g
# 8-bit
git clone https://huggingface.co/mayank31398/starcoder-GPTQ-8bit-128g

Alternatively, you can also use the scripts to get the quantized models and save them to disk.

For generation use:

# fp32
python -m santacoder_inference bigcode/starcoder --wbits 32
# bf16
python -m santacoder_inference bigcode/starcoder --wbits 16

# GPTQ int8
python -m santacoder_inference bigcode/starcoder --wbits 8 --groupsize 128 --load starcoder-GPTQ-8bit-128g/model.pt
# GPTQ int4
python -m santacoder_inference bigcode/starcoder --wbits 4 --groupsize 128 --load starcoder-GPTQ-4bit-128g/model.pt

StarCoderBase

Visit mayank31398/starcoderbase-GPTQ-4bit-128g for the 4-bit weights. Visit mayank31398/starcoderbase-GPTQ-8bit-128g for the 8-bit weights.

# 4-bit
git clone https://huggingface.co/mayank31398/starcoderbase-GPTQ-4bit-128g
# 8-bit
git clone https://huggingface.co/mayank31398/starcoderbase-GPTQ-8bit-128g

Alternatively, you can also use the scripts to get the quantized models and save them to disk.

For generation use:

# fp32
python -m santacoder_inference bigcode/starcoderbase --wbits 32
# bf16
python -m santacoder_inference bigcode/starcoderbase --wbits 16

# GPTQ int8
python -m santacoder_inference bigcode/starcoderbase --wbits 8 --groupsize 128 --load starcoderbase-GPTQ-8bit-128g/model.pt
# GPTQ int4
python -m santacoder_inference bigcode/starcoderbase --wbits 4 --groupsize 128 --load starcoderbase-GPTQ-4bit-128g/model.pt

Acknowledgements

This code is based on GPTQ

Triton GPTQ kernel code is based on GPTQ-triton

mayank31398/GPTQ-for-SantaCoder

GPTQ-for-SantaCoder-and-StarCoder

Result

Result

Result

Installation

Language Generation

SantaCoder

StarCoder

StarCoderBase

Acknowledgements