Quantized Inference on Generative LLMs (QIGen)

Code generator for inference on Quantized Large Language Models. Quantization is done using GPTQ.

Current features

Install dependencies via pip install -r requirements.txt
Install transformers from source pip install git+https://github.com/huggingface/transformers
Install the python module python setup.py install. This will run a search to find the best parameters for register usage.

We give an example notebook in demo.ipynb. The basic workflow is

load floating point model,
load quantized checkpoint from GPTQ,
call the infergen.swap_modules_llama(model, quantized_checkpoint, bits=4, p=64, l1=l1, inplace=False) function, where model is the full-size model, quantized_checkpoint is the quantized model, bits is the number of bits used for the quantization,l1 is the size of the l1 data cache in bits, p is the number of cores to use, and inplace is a flag to swap in place or creating a copy.
Use the quantized model as a normal transformer.