/quantizations

A collection of quantization recipes for various large models including Llama-2-70B, QWen-14B, Baichuan-2-13B, and more.

Primary LanguagePython

quantizations

A collection of quantization recipes for various large models including Llama-2-70B, QWen-14B, Baichuan-2-13B, and more.

Install

First install the requirements

conda create -n quantization python=3.9 -y
conda activate quantization
pip install -r requirements.txt

Then install auto-gptq from my fork

git clone https://github.com/ranchlai/AutoGPTQ.git
cd AutoGPTQ
python setup.py build
pip install -e .

Usage

Quantize a model with the following command:

export CUDA_VISIBLE_DEVICES=0
python ../../quantize.py \
--model_name  Llama-2-70b-chat-hf \
--data data.json \
--bits 4 \
--output_folder Llama-2-70b-chat-gptq-4bit-128g \
--max_samples 1024 \
--group_size 128 \
--block_name_to_quantize "model.layers"

Quantized models

Model #Params #bits Download
Llama-2-70B-chat 70B 4bits link
CodeLlama 34B 4bits link
chatglm3-6B 6B 4bits link
Qwen-14B-Chat 14B 4bits link
Baichuan2-13B-chat 13B 4bits link

How to use the quantized models

The quantized models can be used in the same way as the original models. For example, the following code shows how to use the quantized chatglm3-6B model.

from transformers import AutoTokenizer, AutoModel

model_name_or_path = "chatglm3-6B-gptq-4bit-32g"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, device_map="cuda:0")
model = model.eval()
response, history = model.chat(tokenizer, "北京秋天有什么好玩的景点", history=history)
print(response)