/llm-inference-server

Primary LanguagePythonApache License 2.0Apache-2.0

LLM Inference Server

Simply python server to host Large Language Models on a restful API.

Features:

  • Configure a prompt template
  • OpenAI compatible rest api
  • Config APIs provided to load and unload models

Setup

make setup

Start server

make start

Config APIs

1. Load model

curl --request POST \
  --url http://localhost:8000/config/v1/load-model \
  --header 'content-type: application/json' \
  --data '{
  "path": "./mistral-7b-openorca.Q5_K_M.gguf",
  "options": {
    "prompt_tmpl": "chatml"
  }
}'

> Make sure you have downloaded a quantized ggufv2 model. Example: https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF

2. Unload model

curl --request POST \
  --url http://localhost:8000/config/v1/unload-model

OpenAPI spec