4 bits quantization for GPT-NeoX
$ python3 -m venv venv
$ pip install --upgrade pip
$ pip install -r requirements.txt
$ pip install transformers==4.28.0
$ pip install torch==2.0.0
$ pip install torchaudio==2.0.1
$ pip install safetensors==0.3.0
quantize to 4bit
CUDA_VISIBLE_DEVICES=0 python neox.py EleutherAI/gpt-neox-20b wikitext2 --wbits 4 --act-order --true-sequential --groupsize 128 --save gpt-neox-20b-4bit-128g.pt
benchmark (fp16 baseline)
CUDA_VISIBLE_DEVICES=0 python neox.py EleutherAI/gpt-neox-20b wikitext2 --benchmark 2048 --check
benchmark (4bit)
CUDA_VISIBLE_DEVICES=0 python neox.py EleutherAI/gpt-neox-20b wikitext2 --wbits 4 --groupsize 128 --load gpt-neox-20b-4bit-128g.pt --benchmark 2048 --check
test inference (fp16 baseline)
CUDA_VISIBLE_DEVICES=0 python neox_inference.py EleutherAI/gpt-neox-20b --text "The capital of Japan is"
test inference (4bit)
CUDA_VISIBLE_DEVICES=0 python neox_inference.py EleutherAI/gpt-neox-20b --wbits 4 --groupsize 128 --load gpt-neox-20b-4bit-128g.pt --text "The capital of Japan is"
This code if a fork from GPTQ-for-LLaMA which is based on GPTQ by IST-DASLab
GPT-NeoX is host by EleutherAI
Triton GPTQ kernel code is based on GPTQ-triton