A nice way to fire up exllama based models in a little api in docker.
Basically it pulls exllama code from github and then wraps it up in a little container. Exllama has docker support already, this just makes a new container that is a little api.
You can run it in a sub container as an API or you can run it in the dev container directly.
Create /data/models
on your local system, or edit the mount path in ./.devcontainer/devcontainer.json
. It's towards the bottom in the mounts section.
"mounts": [
// map host ssh to container
"source=${env:HOME}/.ssh,target=/home/vscode/.ssh,type=bind,consistency=cached",
"source=/data,target=/data,type=bind,consistency=consistent"
]
Grab the GTPQ models from Hugging face e.g something like https://huggingface.co/TheBloke/Nous-Hermes-13B-GPTQ works well.
git lfs install
git clone https://huggingface.co/TheBloke/Nous-Hermes-13B-GPTQ
This dev container requires a cuda capable GPU.
Edit ./docker/.env
and change the local path to your model. This will be loaded by the API.
PORT=6010
RUN_UID=1000 # set to 0 to run the service as root inside the container
APPLICATION_STATE_PATH=/data # path to the directory holding application state inside the container
MODEL_PATH=/data/models/TheBloke_Wizard-Vicuna-13B-Uncensored-SuperHOT-8K-GPTQ # replace with the actual model path on the host
- Pull the repo
- Open in the dev container
- Run
make prepare
- Run
make run_exllama_service_container
The api is based on this exllama sample. It's been rejigged to be a class and uses waitress.
It supports the same endpoints, infer_precise
, infer_creative
and infer_sphinx
.
There is a file in rest/test_api.rest
that will let you test the api from in VS Code once its up and running
The dev container is built with everything you need to run locally.
- Pull the repo
- Open in the dev container
- Run
make prepare
- Open
langchain/langchain_sample.py
. Edit your model path around line 60 e.g.model_path='/data/models/TheBloke_Nous-Hermes-13B-GPTQ'
. - On the VS Code debug tab, select
Langchain Sample
. Press F5.
Of course you can load the exists exllama samples, such as the web etc from this container without any extra work!
Run other make commands such as build_docker_web
followed by launch_docker_web
or launch_text_bot
. Before doing this edit .env
under the exllama folder.