/AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Primary LanguagePythonMIT LicenseMIT

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Below, you will find examples of how to easily quantize a model and run inference.

Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Inference

Run inference on a quantized model from Huggingface:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

model.generate(...)

Benchmarks

Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.

Model GPU FP16 latency (ms) INT4 latency (ms) Speedup
LLaMA-2-7B 4090 19.97 8.66 2.31x
LLaMA-2-13B 4090 OOM 13.54 --
Vicuna-7B 4090 19.09 8.61 2.22x
Vicuna-13B 4090 OOM 12.17 --
MPT-7B 4090 17.09 12.58 1.36x
MPT-30B 4090 OOM 23.54 --
Falcon-7B 4090 29.91 19.84 1.51x
LLaMA-2-7B A6000 27.14 12.44 2.18x
LLaMA-2-13B A6000 47.28 20.28 2.33x
Vicuna-7B A6000 26.06 12.43 2.10x
Vicuna-13B A6000 44.91 17.30 2.60x
MPT-7B A6000 22.79 16.87 1.35x
MPT-30B A6000 OOM 31.57 --
Falcon-7B A6000 39.44 27.34 1.44x
Detailed benchmark (CPU vs. GPU)

Here is the difference between a fast and slow CPU on MPT-7B:

RTX 4090 + Intel i9 13900K (2 different VMs):

  • CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
  • CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

RTX 4090 + AMD EPYC 7-Series (3 different VMs):

  • CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
  • CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
  • CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}