qwopqwop200/GPTQ-for-LLaMa

8-bit quantization has ridiculous PPL and output nonsense

Closed this issue · 3 comments

On triton branch.
4 bit quantization works fine, but --wbits 8 has ridiculous PPL (2709 for llama-13b), and the output during inference is completely nonsense.

My command for quantizing:
python llama.py ../../models/llama-13b c4 --wbits 8 --true-sequential --act-order --new-eval --save_safetensors ../../models/llama-13b-8bit.safetensors
I tried with or without --groupsize 128 and the results are both nonsense.

@qwopqwop200, does it mean is no longer just about "4 bits quantization of LLaMA using GPTQ" as the README suggests? :)