torch
: tested on 1.13.1+cu117transformers
: tested on version 4.34.0
-
Models:
- LLaMA, LLaMA-2
- OPT
-
Datasets:
- Calibration: C4
- Evaluation:
- Accuracy of tasks: Piqa, ARC-e, ARC-c, BoolQ, COPA, StoryCloze
- PPL: Wikitext2, PTB, C4
-
Quantuzation configurations:
- Weights: per-channel quantization
- Activations: per-tensor dynamic quantization
- Group quantization in weights: optional
- Bit-widths: W4A8 (4-bit per-channel weight, 8-bit per-tensor activation), W4A6, W3A8
All experiments were run on a single NVIDIA A100-40GB.
- Full precision (FP16)
cd zero_shot
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --method full
- AWRQ
# with smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq --smooth --alpha 0.50 --min 0.01
# without smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq
- SmoothQuant
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --method smoothquant --alpha 0.50 --min 0.01
- RTN
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --method rtn
- Weight only (GPTQ)
# with smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --groupsize -1 --blocksize 1 --method gptq --smooth --alpha 0.50 --min 0.01
# without smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --groupsize -1 --blocksize 1 --method gptq
- Activation only
# with smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --act_bits 8 --method act_only --smooth --alpha 0.50 --min 0.01
# without smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --act_bits 8 --method act_only
- Full precision (FP16)
cd ppl
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --method full
- AWRQ
# with smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq --smooth --alpha 0.50 --min 0.10
# without smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq
- SmoothQuant
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --method smoothquant --alpha 0.50 --min 0.10
- RTN
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --method rtn
- Weight only (GPTQ)
# with smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --groupsize -1 --blocksize 1 --method gptq --smooth --alpha 0.50 --min 0.10
# without smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 3 --groupsize -1 --blocksize 1 --method gptq
- Activation only
# with smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --act_bits 8 --method act_only --smooth --alpha 0.5 --min 0.1
# without smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --act_bits 8 --method act_only
-
Results of LLaMA and LLaMA-2 families on zero-shot tasks at W4A8 (4-bit per-channel weight, 8-bit per-tensor activation) quantization.
GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Omniquant: Omnidirectionally calibrated quantization for large language models