Low text quality when inferencing a gguf model with Neural Speed vs llama.cpp
Closed this issue · 5 comments
Current Behavior:
- Genereted a gguf model from llama2-7b using llama.cpp code, and inferencing it directly with Neural Speed gives this text, and as u can see it's low quality:
- While inference the exact same gguf file with llama.cpp gives me this text:
Steps To Reproduce:
# Windows
## llama.cpp
main.exe -m ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap --ignore-eos
## Neural Speed
python scripts\inference.py --model_name llama2 -m ggml-model-q4_0.gguf -n 512 -p "Building a website can be done in 10 simple steps:"
Environment:
- OS: Win11
- HW: SPR w9-3595X E5 128GB
Maybe related due to same setup: Phi-2 converted to f32 works probably fine, int4 produces garbage output, with compute_dtype int8 and group size -1 (Ubuntu 22.04, Raptor Lake notebook).
I don't think int4 with the params u mentioned are the issue, bcs if I don't use gguf models it works just fine with good quality text.
I don't think that the quantization is the problem, too. Because in my case it's GGUF as well (converted and quantized).
Maybe related due to same setup: Phi-2 converted to f32 works probably fine, int4 produces garbage output, with compute_dtype int8 and group size -1 (Ubuntu 22.04, Raptor Lake notebook).
i think you may open a new issue because use when i use quantization param group size -1
also cause bad output