/llm-awq-reproduction

This repository contains the reproduction of the paper "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration".

Primary LanguagePythonMIT LicenseMIT

llm-awq-reproduction

This repository contains the end2end reproduction of the paper "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration".

Table of Contents

Introduction

Based on llm-awq, commit ca11f3.

But modified the following to make it work:

  • Add config.use_cache = False to avoid oom.
  • Manually implement ppl evaluation for wikitext.

Why commit ca11f3:

  • The nearest major commit before 5f06db, which introduces a much more complex CUDA kernel.

Installation

To set up the environment, please use the following steps:

  1. Clone the repository and the submodule:
git clone https://github.com/panjd123/llm-awq-reproduction.git
cd llm-awq-reproduction
git submodule update --init --recursive
  1. Install Original AWQ package:
cd llm-awq # this is the original implementation of AWQ
conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel and optimized FP16 kernels (e.g. layernorm, positional encodings).
cd awq/kernels
python setup.py install
  1. Install our reproduction package:
cd ../../../ # go back to the root directory of this repository
cd my-awq
pip install -e .
  1. Install our W4A16 CUDA kernel (but we use awq's official kernel by default):
cd my_awq/kernels
python setup.py install

to use our custom kernel, --kernels custom should be added to the command line.

Reproduce the Results on Llama-2-7b

If my_awq is not installed, make should you are in the my-awq directory (llm-awq-reproduction/my-awq) to run the following commands.

If you can access huggingface.co:

bash ../scripts/online.sh

mkdir -p /data/my_awq_cache
mkdir -p /data/my_quant_cache

python -m my_awq.entry --model_path meta-llama/Llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --calib_dataset_path mit-han-lab/pile-val-backup --dump_awq /data/my_awq_cache/llama-2-7b-w4-g128.pt

python -m my_awq.entry --model_path meta-llama/Llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --load_awq /data/my_awq_cache/llama-2-7b-w4-g128.pt --q_backend fake \
    --run_eval --wikitext_path wikitext

python -m my_awq.entry --model_path meta-llama/Llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --load_awq /data/my_awq_cache/llama-2-7b-w4-g128.pt --q_backend real \
    --dump_quant /data/my_quant_cache/llama-2-7b-w4-g128.pt

python -m my_awq.entry --model_path meta-llama/Llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --load_quant /data/my_quant_cache/llama-2-7b-w4-g128.pt --q_backend real \
    --run_eval --wikitext_path wikitext
    
python -m my_awq.entry --model_path meta-llama/Llama-2-7b \
    --run_eval --wikitext_path wikitext

Otherwise:

Download the models and datasets to the following paths:

bash ../scripts/offline.sh

mkdir -p /data/my_awq_cache
mkdir -p /data/my_quant_cache

python -m my_awq.entry --model_path /data/models/llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --calib_dataset_path /data/datasets/pile-val-backup --dump_awq /data/my_awq_cache/llama-2-7b-w4-g128.pt

python -m my_awq.entry --model_path /data/models/llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --load_awq /data/my_awq_cache/llama-2-7b-w4-g128.pt --q_backend fake \
    --run_eval --wikitext_path /data/datasets/wikitext

python -m my_awq.entry --model_path /data/models/llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --load_awq /data/my_awq_cache/llama-2-7b-w4-g128.pt --q_backend real \
    --dump_quant /data/my_quant_cache/llama-2-7b-w4-g128.pt

python -m my_awq.entry --model_path /data/models/llama-2-7b \
    --w_bit 4 --q_group_size 128 \
    --load_quant /data/my_quant_cache/llama-2-7b-w4-g128.pt --q_backend real \
    --run_eval --wikitext_path /data/datasets/wikitext
    
python -m my_awq.entry --model_path /data/models/llama-2-7b \
    --run_eval --wikitext_path /data/datasets/wikitext

Output

output