/GPTQ-for-LLaMa

4 bits quantization of LLMs using GPTQ

Primary LanguagePythonApache License 2.0Apache-2.0

GPTQ-for-LLaMA

4 bits quantization of LLaMA using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

This version has been created and tested for use with KoboldAI

New Features

  • Optimized CPU Offloading
  • Optimized GPU Splitting
  • Backwards Compatibility with older GPTQ-models

Currently, groupsize and act-order do not work together and you must choose one of them.

Result

LLaMA-7B(click me)
LLaMA-7B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - 13940 5.68 12.5
RTN 4 - - 6.29 -
GPTQ 4 - 4740 6.09 3.5
RTN 3 - - 25.54 -
GPTQ 3 - 3852 8.07 2.7
GPTQ 3 128 4116 6.61 3.0
LLaMA-13B
LLaMA-13B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 5.09 24.2
RTN 4 - - 5.53 -
GPTQ 4 - 8410 5.36 6.5
RTN 3 - - 11.40 -
GPTQ 3 - 6870 6.63 5.1
GPTQ 3 128 7277 5.62 5.4
LLaMA-33B
LLaMa-33B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 4.10 60.5
RTN 4 - - 4.54 -
GPTQ 4 - 19493 4.45 15.7
RTN 3 - - 14.89 -
GPTQ 3 - 15493 5.69 12.0
GPTQ 3 128 16566 4.80 13.0
LLaMA-65B
LLaMA-65B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 3.53 121.0
RTN 4 - - 3.92 -
GPTQ 4 - OOM 3.84 31.1
RTN 3 - - 10.59 -
GPTQ 3 - OOM 5.04 23.6
GPTQ 3 128 OOM 4.17 25.6

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Installation

pip install git+https://github.com/0cc4m/GPTQ-for-LLaMa@c884b421a233f9603d8224c9b22c2d83dd2c1fc4

old instructions:

If you don't have conda, install it first.

  conda create --name gptq python=3.9 -y
  conda activate gptq
  conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  # Or, if you're having trouble with conda, use pip with python3.9:
  # pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

  git clone https://github.com/0cc4m/GPTQ-for-LLaMa
  cd GPTQ-for-LLaMa
  pip install -r requirements.txt
  python setup_cuda.py install

  # Benchmark performance for FC2 layer of LLaMa-7B
  CUDA_VISIBLE_DEVICES=0 python test_kernel.py

Dependencies

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMA

The format for using this version of GPTQ has changed from specifying python files, to specifying the module name.

Old Command New Way
python llama.py python -m gptq.llama
python gptj.py python -m gptq.gptj
python opt.py python -m gptq.opt
python gptneox.py python -m gptq.gptneox
python llama_inference.py python -m gptq.llama_inference
python llama_inference_offload.py python -m gptq.llama_inference_offload
python convert_llama_weights_to_hf.py python -m gptq. convert_llama_weights_to_hf
#convert LLaMA to hf
python -m gptq.convert_llama_weights_to_hf --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

# Benchmark language generation with 4-bit LLaMA-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save llama7b-4bit.pt
# Or save compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save_safetensors llama7b-4bit.safetensors
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama ./llama-hf/llama-7b c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python -m gptq.llama ./llama-hf/llama-7b c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama_inference ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama"
# model inference with the saved model with offload(This is very slow. This is a simple implementation and could be improved with technologies like flexgen(https://github.com/FMInference/FlexGen).
CUDA_VISIBLE_DEVICES=0 python -m gptq.llama_inference_offload ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.

CUDA Kernels support 2,3,4,8 bits and Faster CUDA Kernels support 2,3,4 bits.

Basically, 4-bit quantization and 128 groupsize are recommended.

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMA, a powerful LLM.