Huge performance difference in "Transformer-like" usage and "llama.cpp-like" usage
Ankur-singh opened this issue · 2 comments
Llama.cpp-like usage by running scripts is really fast but when I try to use from ITREX library (which makes use of Transformer-like usage) the performance difference is huge. Here is the time take by each approach:
- Transformer-like usage: >10 mins
- Llama.cpp-like usage: ~2 mins
Here's how I am using them:
Transformer Usage:
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
Llama.cpp usage:
python neural-speed/scripts/inference.py --model_name mistral -m runtime_outs/ne_mistral_q_nf4_bestla_cfp32_g32.bin -c 512 -b 1024 -n 256 --color -p "She opened the door and see"
Is there something that I am missing?
@Ankur-singh Hi, please add the numactl if you use transformer-like APIs. It helps you get the expected performance.
For example:
nnumactl -m 0 -C 0-55 python_api_example_for_gguf.py
Actually you can set the numactl for both usages.
numactl -m 0 -C 0-55 python neural-speed/scripts/inference.py --model_name mistral -m runtime_outs/ne_mistral_q_nf4_bestla_cfp32_g32.bin -c 512 -b 1024 -n 256 --color -p "She opened the door and see"
Thanks @Zhenzhong1 worked like a charm.