NolanoOrg/cformers

Upload GPTQ Quantized models in 4-bit precision format for different bin-sizes to Huggingface

Ayushk4 opened this issue · 2 comments

  • GPTJ
  • BLOOM
  • GPT-NeoX
  • Pythia Series
  • Open-Assistant
  • Codegen
  • OPT & Galactica
    ...

Per Int-4 LLaMa is not enough - Int-3 and beyond binning with a bin size of 128 appears to reduce most of the remaining output quality loss of GPTQ for models larger than ~10B while only negligably effecting the memory requirement.

GPTQ-for-LLaMa, one of the first GPTQ projects, is already moving towards 3bit with binning being the new default.

Given that memory bandwidth is the major bottleneck on CPU, fewer bits means faster inference. For models large enough (~10B+) 3bit GPTQ with binning may be the way to go.

Thanks for the suggestion @MarkSchmidty . I am opening a separate issue (#12) as this will require new C/CPP kernels to be added as well.

In short Int-4 LLaMa is not enough study assumed that only weight was being quantized, not intermediate representations. We need to either add new kernels or do another study of the performance drop.