Choosing Model via the post request, when making API Call.
Opened this issue · 10 comments
Currently Providing model is a required argument.
python vision.py
usage: vision.py [-h] -m MODEL [-b BACKEND] [-f FORMAT] [-d DEVICE] [--device-map DEVICE_MAP]
[--max-memory MAX_MEMORY] [--no-trust-remote-code] [-4] [-8] [-F] [-T MAX_TILES]
[-L {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-P PORT] [-H HOST] [--preload]
vision.py: error: the following arguments are required: -m/--model
AIM:Adding the ability to Choose the model when Calling the API. This would be a great option.
It gives the additional flexibility.
If I understand you correctly, you would want the model loaded as specified by the client side?
So, something like:
response = client.chat.completions.create(model="OpenGVLab/InternVL2-Llama3-76B", messages=messages, **params)
This is a bit complex because you can't specify any options like --load-in-4bit, flash-attn, etc. It would probably need a model specific default config which would be loaded also with the request. I'm working on a system for this with the openedai-image server, but am not really happy yet with how complex it is.
Yes Correct.
So to make it simple, we could set some default value such as --load-in-4bit, flash-attn ..etc for all models to start with.
Based on the request it receives, it download the model and get it ready to be served ( Which means, first API call will take some time to get the response back )
Just FYI, I think the openai client times out after about 30 or 60 seconds, so it's likely this will not work well unless the model is very small.
What about a web UI instead? I just don't think the API is well suited for model management but I do admit it's a nice feature.
Sure. I get you. That's fine.
Just another thought.
For the first request it should just respond saying "Model is being downloaded, please try again after few mins" (Considering openai client time out is 60 seconds)
If it's not well suited and complicated then we don't need to do this. These are just ideas.
Yes, Web UI is fine instead. Thanks :)
here is how it is done by llama cpp
https://github.com/Jaimboh/Llama.cpp-Local-OpenAI-server/blob/main/README.md
Multiple Model Load with Config
python -m llama_cpp.server --config_file config.json
cat config.json
{
"host": "0.0.0.0",
"port": 8000,
"models": [
{
"model": "models/mistral-7b-instruct-v0.1.Q4_0.gguf",
"model_alias": "mistral",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/mixtral-8x7b-instruct-v0.1.Q2_K.gguf",
"model_alias": "mixtral",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/mistral-7b-instruct-v0.1.Q4_0.gguf",
"model_alias": "mistral-function-calling",
"chat_format": "functionary",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
}
]
}
you can preload an array of models as specified in config.json and it is smart enough to swap in to the right model as specified in the client request
here is a blog with a simple streamlit gui interface too
https://medium.com/@odhitom09/running-openais-server-locally-with-llama-cpp-5f29e0d955b7
i would love some kinda gui when interacting with these multimodal models especially initially before i know how to automate it
i would love some kinda gui when interacting with these multimodal models especially initially before i know how to automate it
For manually interacting with the models I can highly recommend either open-webui (via docker, which also works with openedai-speech, whisper, images, etc - I use this) or web.chatbox.app (can be used fully in browser, without any installation), you can configure an openai api provider (with the API BASE URL) for the gpt-4-vision-preview
model in either of those and use the GUI there to upload and chat with images in context.
For testing, I prefer the 'raw' text output from the included console app chat_with_image.py
.
@matatonic
I am very interested in the ability of openedai-vision to load multiple models simultaneously . For example florence2 is very good at boundingboxes, moondream2 is very fast with a summary inference, the minicpm2.5 model is the newest kid on the block and takes more firepower but performs better and can compare two input images etc. The point is different models excel at different things and the pipeline may need to invoke all three of them efficiently. It would be excellent is we could support all three models and preload them and have the right model be invoked by the post request when the API call is made. How feasible is this to do?
Thanks again for providing the newest models so swiftly.
It's doable, and I will probably do this along with model switching/selecting via API in an upcoming release.
It's a more significant change, and I'll need to update my testing also so it might take a bit longer.
PS. I'm currently out of country and have limited access to the internet.