/GPTQ-NeoX

4 bits quantization of GPTNeoX using GPTQ

Primary LanguagePythonApache License 2.0Apache-2.0

GPTQ-NeoX

4 bits quantization for GPT-NeoX

Installation

$ python3 -m venv venv
$ pip install --upgrade pip
$ pip install -r requirements.txt
$ pip install transformers==4.28.0
$ pip install torch==2.0.0
$ pip install torchaudio==2.0.1
$ pip install safetensors==0.3.0

Usage

Quantize

quantize to 4bit

CUDA_VISIBLE_DEVICES=0 python neox.py EleutherAI/gpt-neox-20b wikitext2 --wbits 4 --act-order --true-sequential --groupsize 128 --save gpt-neox-20b-4bit-128g.pt

Benchmark

benchmark (fp16 baseline)

CUDA_VISIBLE_DEVICES=0 python neox.py EleutherAI/gpt-neox-20b wikitext2 --benchmark 2048 --check

benchmark (4bit)

CUDA_VISIBLE_DEVICES=0 python neox.py EleutherAI/gpt-neox-20b wikitext2 --wbits 4 --groupsize 128 --load gpt-neox-20b-4bit-128g.pt --benchmark 2048 --check

Test Inference

test inference (fp16 baseline)

CUDA_VISIBLE_DEVICES=0 python neox_inference.py EleutherAI/gpt-neox-20b --text "The capital of Japan is"

test inference (4bit)

CUDA_VISIBLE_DEVICES=0 python neox_inference.py EleutherAI/gpt-neox-20b --wbits 4 --groupsize 128 --load gpt-neox-20b-4bit-128g.pt --text "The capital of Japan is"

Acknowledgements

This code if a fork from GPTQ-for-LLaMA which is based on GPTQ by IST-DASLab
GPT-NeoX is host by EleutherAI
Triton GPTQ kernel code is based on GPTQ-triton