This is a simple example of a self hosted LLM API, using Llama 2 7B-chat. More to follow.
You need to have at least one model (eg. Llama 2 7B-chat). You can do so easily with:
wget https://huggingface.co/localmodels/Llama-2-7B-Chat-ggml/resolve/main/llama-2-7b-chat.ggmlv3.q2_K.bin
Instead of coupling the model(s) with the Docker image (API), you should mount this in (either via Docker or Kubernetes). This allows you to update the model(s) without having to rebuild the image.
I have included a multistage Docker build for convenience. You can build the image with:
TAG=0.1.1
docker build -t smigula/pyllama:$TAG .
docker run --mount type=bind,source="$(pwd)"/models,target=/models -p 8501:8501 smigula/pyllama:$TAG