Unify the API routes for different inference engines
kerthcet opened this issue · 3 comments
What would you like to be added:
I haven't checked yet but some projects like llama.cpp has a different API routes than OpenAI, let's make them unified for UX.
We may achieve this via a proxy running as a side car container?
Why is this needed:
Completion requirements:
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.
/assign
I guess this is a mistake by me because llama.cpp also support OpenAI routes like
curl --request POST \
--url http://localhost:8080/v1/completions \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
{"content":" You choose a domain name, you choose your domain registrar, you buy domain name and domain registrar software, you register your domain with your registrar and register your email, you create a website template, you design your website, you upload your website files, you deploy your website, and finally, you launch your website. Of course, there are other steps in the process that can be done, such as: getting your domain, hosting your domain, making your website visible, and many more. But all of these steps can be done without a domain name! You can build a website on your own using just 10 simple steps, and","id_slot":0,"stop":true,"model":"/workspace/models/qwen2-0_5b-instruct-q5_k_m.gguf","tokens_predicted":128,"tokens_evaluated":13,"generation_settings":{"n_ctx":32768,"n_predict":-1,"model":"/workspace/models/qwen2-0_5b-instruct-q5_k_m.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"Building a website can be done in 10 simple steps:","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":140,"timings":{"prompt_n":13,"prompt_ms":288.028,"prompt_per_token_ms":22.156000000000002,"prompt_per_second":45.13450081242101,"predicted_n":128,"predicted_ms":7323.293,"predicted_per_token_ms":57.2132265625,"predicted_per_second":17.478475871441987},"index":0}%
Let's close this for now. Reopen if necessary.
/close
From https://github.com/ggerganov/llama.cpp#web-server
llama.cpp web server is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
Here is the official promise.