8-bit quantization has ridiculous PPL and output nonsense

Question

8-bit quantization has ridiculous PPL and output nonsense

Closed this issue 2 years ago · 3 comments

On triton branch.
4 bit quantization works fine, but --wbits 8 has ridiculous PPL (2709 for llama-13b), and the output during inference is completely nonsense.

qwopqwop200 commented 2 years ago

fix bug

Answer 1 · 2023-04-12T16:48:16.000Z

My command for quantizing:
python llama.py ../../models/llama-13b c4 --wbits 8 --true-sequential --act-order --new-eval --save_safetensors ../../models/llama-13b-8bit.safetensors
I tried with or without --groupsize 128 and the results are both nonsense.

Answer 2 · 2023-04-15T16:16:48.000Z

@qwopqwop200, does it mean is no longer just about "4 bits quantization of LLaMA using GPTQ" as the README suggests? :)