ialacol (pronounced "localai") is a lightweight drop-in replacement for OpenAI API.
It is an OpenAI API-compatible wrapper ctransformers supporting GGML/GPTQ with optional CUDA/Metal acceleration.
ialacol is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.
- Compatibility with OpenAI APIs, allowing you to use any frameworks that are built on top of OpenAI APIs such as langchain.
- Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
- Streaming first! For better UX.
- Optional CUDA acceleration.
See Receipts below for instructions of deployments.
- LLaMa 2 variants
- OpenLLaMA variants
- StarCoder variants
- WizardCoder
- StarChat variants
- MPT-7B
- MPT-30B
- Falcon
And all LLMs supported by ctransformers.
- Containerized AI before Apocalypse 🐳🤖
- Deploy Llama 2 AI on Kubernetes, Now
- Cloud Native Workflow for Private MPT-30B AI Apps
- Offline AI 🤖 on Github Actions 🙅♂️💰
ialacol
offer first class citizen support for Kubernetes, which means you can automate/configure everything compare to runing without.
To quickly get started with ialacol on Kubernetes, follow the steps below:
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol
By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.
Port-forward
kubectl port-forward svc/llama-2-7b-chat 8000:8000
Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin
using curl
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
http://localhost:8000/v1/chat/completions
Alternatively, using OpenAI's client library (see more examples in the examples/openai
folder).
openai -k "sk-fake" \
-b http://localhost:8000/v1 -vvvvv \
api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin \
-g user "Hello world!"
There is a image hosted on ghcr.io (alternatively CUDA11,CUDA12,METAL,GPTQ variants).
docker run --rm -it -p 8000:8000 \
-e DEFAULT_MODEL_HG_REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
-e DEFAULT_MODEL_FILE="llama-2-7b-chat.ggmlv3.q4_0.bin" \
ghcr.io/chenhunghan/ialacol:latest
For developers/contributors
Build image
docker build --file ./Dockerfile -t ialacol .
Run container
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/orca_mini_3B-GGML"
export DEFAULT_MODEL_FILE="orca-mini-3b.ggmlv3.q4_0.bin"
docker run --rm -it -p 8000:8000 \
-e DEFAULT_MODEL_HG_REPO_ID=$DEFAULT_MODEL_HG_REPO_ID \
-e DEFAULT_MODEL_FILE=$DEFAULT_MODEL_FILE ialacol
To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS
environment variable. GPU_LAYERS
is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.
deployment.image
=ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.
deployment.image
=ghcr.io/chenhunghan/ialacol-cuda12:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.
Only llama
, falcon
, mpt
and gpt_bigcode
(StarCoder/StarChat) support CUDA.
helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml
Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.
helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml
Deploys Starcoderplus-Guanaco-GPT4-15B-V1.0 model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.
If you see CUDA driver version is insufficient for CUDA runtime version
when making the request, you are likely using a Nvidia Driver that is not compatible with the CUDA version.
Upgrade the driver manually on the node (See here if you are using CUDA11 + AMI). Or try different version of CUDA.
To enable Metal support, use the image ialacol-metal
built for metal.
deployment.image
=ghcr.io/chenhunghan/ialacol-metal:latest
For example
helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml
To use GPTQ, you must
deployment.image
=ghcr.io/chenhunghan/ialacol-gptq:latest
deployment.env.MODEL_TYPE
=gptq
For example
helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml
kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user "Hello world!"
LLMs are known to be sensitive to parameters, the higher temperature
leads to more "randomness" hence LLM becomes more "creative", top_p
and top_k
also contribute to the "randomness"
If you want to make LLM be creative.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
http://localhost:8000/v1/chat/completions
If you want to make LLM be more consistent and genereate the same result with the same input.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
http://localhost:8000/v1/chat/completions
- Support
starcoder
model type via ctransformers, including: - Mimic restof OpenAI API, including
GET /models
andPOST /completions
- GPU acceleration (CUDA/METAL)
- Support
POST /embeddings
backed by huggingface Apache-2.0 embedding models such as Sentence Transformers and hkunlp/instructor - Suuport Apache-2.0 fastchat-t5-3b
- Support more Apache-2.0 models such as codet5p and others listed here
Deploy Meta's Llama 2 Chat model quantized by TheBloke.
7B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml
13B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml
70B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml
Deploy OpenLLaMA 7B model quantized by rustformers.
ℹ️ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml
Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml
Deploy MosaicML's MPT-7B model quantized by rustformers. ℹ️ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml
Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml
Deploy Uncensored Falcon 7B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml
Deploy Uncensored Falcon 40B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml
Deploy starchat-beta
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml
Deploy WizardCoder
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml
Deploy light-weight pythia-70m
model with only 70 millions paramters (~40MB) quantized by rustformers.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml
Deploy RedPajama
3B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml
Deploy stableLM
7B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt