intel/neural-speed

Low text quality when inferencing a gguf model with Neural Speed vs llama.cpp

Closed this issue · 5 comments

Current Behavior:

  • Genereted a gguf model from llama2-7b using llama.cpp code, and inferencing it directly with Neural Speed gives this text, and as u can see it's low quality:

image

  • While inference the exact same gguf file with llama.cpp gives me this text:

image

Steps To Reproduce:

# Windows
## llama.cpp
main.exe -m ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap --ignore-eos
## Neural Speed
python scripts\inference.py --model_name llama2 -m ggml-model-q4_0.gguf -n 512 -p "Building a website can be done in 10 simple steps:"

Environment:

  • OS: Win11
  • HW: SPR w9-3595X E5 128GB

Maybe related due to same setup: Phi-2 converted to f32 works probably fine, int4 produces garbage output, with compute_dtype int8 and group size -1 (Ubuntu 22.04, Raptor Lake notebook).

I don't think int4 with the params u mentioned are the issue, bcs if I don't use gguf models it works just fine with good quality text.

I don't think that the quantization is the problem, too. Because in my case it's GGUF as well (converted and quantized).

Maybe related due to same setup: Phi-2 converted to f32 works probably fine, int4 produces garbage output, with compute_dtype int8 and group size -1 (Ubuntu 22.04, Raptor Lake notebook).

i think you may open a new issue because use when i use quantization param group size -1 also cause bad output

@aahouzi Hi, please try transformer-based APIs by using this script scripts/python_api_example_for_gguf.py.

Example in this PR: #48.

I have double-chekced it. The low quality text issue will not happen again if you use this script.