mudler/LocalAI

feat. add OpenVINO Model Server as a Backend

fakezeta opened this issue · 0 comments

Is your feature request related to a problem? Please describe.
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.

With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.

Could be an effective solution for computer with iGPU

I've uploaded an OpenVINO version of openchat-3.5-0106 to HF for testing https://huggingface.co/fakezeta/openchat-3.5-0106-openvino-int8/

It will be compatible with torch, onnx, openvino model format.

Describe the solution you'd like

This could be implemente with Optimum-Intel library or with gRPC OpenVINO model server