AWRQ: Activation-aware Weight Reformulation Quantization for Large Language Models

Quick start

Dependency

torch: tested on 1.13.1+cu117
transformers: tested on version 4.34.0

Support

Models:
- LLaMA, LLaMA-2
- OPT
Datasets:
- Calibration: C4
- Evaluation:
  - Accuracy of tasks: Piqa, ARC-e, ARC-c, BoolQ, COPA, StoryCloze
  - PPL: Wikitext2, PTB, C4
Quantuzation configurations:
- Weights: per-channel quantization
- Activations: per-tensor dynamic quantization
- Group quantization in weights: optional
- Bit-widths: W4A8 (4-bit per-channel weight, 8-bit per-tensor activation), W4A6, W3A8

All experiments were run on a single NVIDIA A100-40GB.

Usage

LLaMA and LLaMA-2: zero_shot

Full precision (FP16)

cd zero_shot
 CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --method full

AWRQ

# with smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq --smooth --alpha 0.50 --min 0.01
# without smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq

SmoothQuant

CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --method smoothquant --alpha 0.50 --min 0.01

CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --act_bits 8 --groupsize -1 --method rtn

Weight only (GPTQ)

# with smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --groupsize -1 --blocksize 1 --method gptq --smooth --alpha 0.50 --min 0.01
# without smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --wbits 4 --groupsize -1 --blocksize 1 --method gptq

Activation only

# with smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --act_bits 8 --method act_only --smooth --alpha 0.50 --min 0.01
# without smoothing
CUDA_VISIBLE_DEVICES=0 python main.py meta-llama/Llama-2-7b --calib_data c4 --tasks piqa,arc_easy,arc_challenge,boolq,copa,storycloze --table_results --act_bits 8 --method act_only

OPT: ppl

Full precision (FP16)

cd ppl
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --method full

AWRQ

# with smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq --smooth --alpha 0.50 --min 0.10
# without smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --blocksize 1 --method awrq

SmoothQuant

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --method smoothquant --alpha 0.50 --min 0.10

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --act_bits 8 --groupsize -1 --method rtn

Weight only (GPTQ)

# with smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 4 --groupsize -1 --blocksize 1 --method gptq --smooth --alpha 0.50 --min 0.10
# without smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --wbits 3 --groupsize -1 --blocksize 1 --method gptq

Activation only

# with smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --act_bits 8 --method act_only --smooth --alpha 0.5 --min 0.1
# without smoothing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m --calib_data c4 --act_bits 8 --method act_only

Main Results

LLaMA and LLaMA-2 families

Results of LLaMA and LLaMA-2 families on zero-shot tasks at W4A8 (4-bit per-channel weight, 8-bit per-tensor activation) quantization.
Speedup on LLaMA-1-30B and LLaMA-2-13B.