intel/neural-speed

Huge performance difference in "Transformer-like" usage and "llama.cpp-like" usage

Ankur-singh opened this issue · 2 comments

Llama.cpp-like usage by running scripts is really fast but when I try to use from ITREX library (which makes use of Transformer-like usage) the performance difference is huge. Here is the time take by each approach:

  • Transformer-like usage: >10 mins
  • Llama.cpp-like usage: ~2 mins

Here's how I am using them:
Transformer Usage:

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300) 

Llama.cpp usage:

python neural-speed/scripts/inference.py --model_name mistral -m runtime_outs/ne_mistral_q_nf4_bestla_cfp32_g32.bin -c 512 -b 1024 -n 256 --color -p "She opened the door and see"

Is there something that I am missing?

@Ankur-singh Hi, please add the numactl if you use transformer-like APIs. It helps you get the expected performance.

For example:
nnumactl -m 0 -C 0-55 python_api_example_for_gguf.py

Actually you can set the numactl for both usages.
numactl -m 0 -C 0-55 python neural-speed/scripts/inference.py --model_name mistral -m runtime_outs/ne_mistral_q_nf4_bestla_cfp32_g32.bin -c 512 -b 1024 -n 256 --color -p "She opened the door and see"

Thanks @Zhenzhong1 worked like a charm.