AffineQuant: Affine Transformation Quantization for Large Language Models (Link)
AffineQuant is a simple and powerful quantization technique for LLMs.
conda create -n affinequant python=3.10 -y
conda activate affinequant
git clone https://github.com/bytedance/AffineQuant.git
cd AffineQuant
pip install --upgrade pip
pip install -e .
We also leverage the kernel from AutoGPTQ to achieve real quantization. So you should also install the bug-fixed AutoGPTQ as follows:
pip install auto_gptq
Coming Soon.
We provide full script to run AffineQuant in ./scripts/
. We use LLaMa-7B as an example here:
- Obtain the channel-wise scales and shifts required for initialization:
Optional, we also offer the script that you can generate channel-wise scales and shifts by yourself:
python generate_act_scale_shift.py --model /PATH/TO/LLaMA/llama-7b
- Weight-only quantization
# W3A16
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b \
--epochs 20 --output_dir ./log/llama-7b-w3a16 \
--eval_ppl --wbits 3 --abits 16 --lwc --let --use_ln_matrix --sf 1e-2
# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b \
--epochs 20 --output_dir ./log/llama-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --let --use_ln_matrix --sf 1e-2
- weight-activation quantization
# W4A4
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \
--tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
More detailed and optional arguments:
--model
: the local model path or huggingface format.--wbits
: weight quantization bits.--abits
: activation quantization bits.--group_size
: group size of weight quantization. If no set, use per-channel quantization for weight as default.--epochs
: training epochs. You can set it as 0 to evaluate pre-trained AffineQuant checkpoints.--nsamples
: number of calibration samples, 128 as default.--eval_ppl
: evaluating the perplexity of quantized models.--tasks
: evaluating zero-shot tasks.--resume
: loading pre-trained AffineQuant parameters.--multigpu
: to inference larger network on multiple GPUs--real_quant
: real quantization, which can see memory reduce--save_dir
: saving the quantization model for further exploration.--use_matrix
: using qkt affine mateix or not.--use_ln_matrix
: using layernorm affine matrix.--sf
: stability factor for gradual mask.
- AffineQuant achieve SoTA performance in weight-only quantization
- AffineQuant achieve SoTA performance in weight-activation quantization
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers
RPTQ: Reorder-Based Post-Training Quantization for Large Language Models
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
@inproceedings{
ma2024affinequant,
title={AffineQuant: Affine Transformation Quantization for Large Language Models},
author={Yuexiao Ma and Huixia Li and Xiawu Zheng and Feng Ling and Xuefeng Xiao and Rui Wang and Shilei Wen and Fei Chao and Rongrong Ji},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=of2rhALq8l}
}