Low text quality when inferencing a gguf model with Neural Speed vs llama.cpp

Question

Low text quality when inferencing a gguf model with Neural Speed vs llama.cpp

Closed this issue 8 months ago · 5 comments

Current Behavior:

Genereted a gguf model from llama2-7b using llama.cpp code, and inferencing it directly with Neural Speed gives this text, and as u can see it's low quality:

While inference the exact same gguf file with llama.cpp gives me this text:

Steps To Reproduce:

# Windows
## llama.cpp
main.exe -m ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap --ignore-eos
## Neural Speed
python scripts\inference.py --model_name llama2 -m ggml-model-q4_0.gguf -n 512 -p "Building a website can be done in 10 simple steps:"

Environment:

OS: Win11
HW: SPR w9-3595X E5 128GB

Answer 1 · 2024-01-25T12:04:24.000Z

Maybe related due to same setup: Phi-2 converted to f32 works probably fine, int4 produces garbage output, with compute_dtype int8 and group size -1 (Ubuntu 22.04, Raptor Lake notebook).

Answer 2 · 2024-01-25T13:14:03.000Z

I don't think int4 with the params u mentioned are the issue, bcs if I don't use gguf models it works just fine with good quality text.

Answer 3 · 2024-01-25T13:16:48.000Z

I don't think that the quantization is the problem, too. Because in my case it's GGUF as well (converted and quantized).

Answer 4 · 2024-01-26T06:26:12.000Z

Maybe related due to same setup: Phi-2 converted to f32 works probably fine, int4 produces garbage output, with compute_dtype int8 and group size -1 (Ubuntu 22.04, Raptor Lake notebook).

i think you may open a new issue because use when i use quantization param group size -1 also cause bad output

Answer 5 · 2024-02-18T05:12:52.000Z

@aahouzi Hi, please try transformer-based APIs by using this script scripts/python_api_example_for_gguf.py.

Example in this PR: #48.

I have double-chekced it. The low quality text issue will not happen again if you use this script.