The inference-manager manages inference runtimes (e.g., vLLM and Ollama) in containers, load models, and process requests.
Please see inference_request_flow.md.
Requirements:
Run the following command:
make setup-allTip
- Run just only
make helm-reapply-inference-serverormake helm-reapply-inference-engine, it will rebuild inference-manager container images, deploy them using the local helm chart, and restart containers. - You can configure parameters in .values.yaml.
To run vLLM on ARM CPU (macOS), you'll need to build an image.
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g .
kind load docker-image vllm-cpu-env:latestThen, run make with the RUNTIME option.
make setup-all RUNTIME=vllmNote
See vLLM - ARM installation for details.
with curl:
curl --request POST http://localhost:8080/v1/chat/completions -d '{
"model": "google-gemma-2b-it-q4_0",
"messages": [{"role": "user", "content": "hello"}]
}'with llma:
export LLMARINER_API_KEY=dummy
llma chat completions create \
--model google-gemma-2b-it-q4_0 \
--role system \
--completion 'hi'