I am not sure why there isnt something like this already, basically a bare bones way to serve dev model using openai api endpoints/schema. Not intended to be performant or scalable but specifically for getting results from benchmarks like AgentBench with minimal effort but easily adaptable per model as the models I am interested in generally do not have a chat template or encoding without images can cause issues.
Most simple way is git clone repo and then pdm install
/sync
otherwise pip install git+https://github.com/grahamannett/troapis
should work.
few ways to use (as I am not clear how I want to be using it yet):
one is via just writing a model_entrypoint.py
file that loads the model in the file, the server will then check for that file and try and load it, this allows you to easily use fastapi dev src/troapis/app.py
to run the server and test everything in an interactive way.
Another way can also run the app and pass it in (as seen in model_entrypoint if __name__ == "__main__":
part)
from troapis.app import run_app
model_info = load_model(...) # would load the model and processor/tokenizer and setup anything else (e.g. chat template or encoding)
run_app(model_info_from=model_info)
to use dev can just do pdm run dev
The model_info
object/dict needs the following to work:
model_name
- the model namemodel
- the model objectprocessor
ortokenizer
- the processor/tokenizer object, if not provided will use the default one from themodel_name
decode
- the decoding function, if not provided will try to useprocessor.decode
ortokenizer.decode
decode_kwargs
- the decoding function kwargs, if not provided will be"skip_special_tokens": True
encode
- the encoding function, if not provided will try to useprocessor.__call__
ortokenizer.__call__
encode_kwargs
- the encoding function kwargs, if not provided will be{"return_tensors": "pt"}
generate
- the generation function, if not provided will try to usemodel.generate
generate_kwargs
- the generation function kwargs, if not provided will be emptyapply_chat_template
- the chat template function, if not provided will useprocessor.tokenizer.apply_chat_template
ortokenizer.apply_chat_template
. this is actually really important for generations for AgentBench to even work and likely the default template for the model that does not have this will result in the model failing almost all tasksapply_chat_template_kwargs
- the chat template function kwargs, if not provided will be{"tokenize": False, "add_generation_prompt": True}
- https://github.com/jquesnelle/transformers-openai-api
- doesnt have
chat/completions
route
- doesnt have
- https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
- doesnt support most models that I need, to add a new model requires https://docs.vllm.ai/en/latest/models/adding_model.html which means forking the repo and patching a lot of stuff, not easy to inspect the model during inference to tell what is happening which is helpful for multimodal models
- https://github.com/lhenault/simpleAI/blob/main/src/simple_ai/api_models.py
- actually might be pretty similar to what i need but found too late and adding another model is more burdensome than I would like
- https://github.com/lm-sys/FastChat/blob/main/playground/FastChat_API_GoogleColab.ipynb
- requires spinning up 3 services to work, the convo templates make adding another model not super easy https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
Drawbacks of this is you have to run 3 services and have to wait for them to start up in particular order:
- controller:
python -m fastchat.serve.controller --host=0.0.0.0
- model
python -m fastchat.serve.model_worker --model-path "adept/fuyu-8b" --model-names "adept/fuyu-8b" --host=0.0.0.0
- can use
--load-8bit
but doesnt seem to improve speed or memory usage?
- can use
- openapi
python -m fastchat.serve.openai_api_server --host=0.0.0.0 --port=11434
MODEL_NAME=...
time curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "'"$MODEL_NAME"'", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello!"}]}'
{"id":"chatcmpl-L6sdTmvpy3zqE8hVUG2uKM","object":"chat.completion","created":1718830977,"model":"...","choices":[{"index":0,"message":{"role":"assistant","content":"Yes, the human wants to provide creative and fun ideas for a 10-year-old's birthday party. What do you think would be the best idea for a 10-year-old?\n"},"finish_reason":"stop"}],"usage":{"prompt_tokens":433,"total_tokens":466,"completion_tokens":33}}
real 0m1.327s
user 0m0.004s
sys 0m0.009s
- allow serving multiple instances of model for concurrent requests
- likely need to use something similar to https://docs.ray.io/en/latest/serve/model-multiplexing.html
- alternative is just use
multiprocessing.Queue
so that can load 1 model on each gpu and serve from available queue. have a feeling this will be more complicated than expected