ialacol (l-o-c-a-l-a-i)

🚧 being rewritten from Python to Rust/WebAssembly, see details #93

Introduction

ialacol (pronounced "localai") is a lightweight drop-in replacement for OpenAI API.

It is an OpenAI API-compatible wrapper ctransformers supporting GGML/GPTQ with optional CUDA/Metal acceleration.

ialacol is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.

Features

Compatibility with OpenAI APIs, compatible with langchain.
Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
Streaming first! For better UX.
Optional CUDA acceleration.
Compatible with Github Copilot VSCode Extension, see Copilot

Supported Models

See Receipts below for instructions of deployments.

LLaMa 2 variants, including OpenLLaMA, Mistral, openchat_3.5 and zephyr.
StarCoder variants
WizardCoder
StarChat variants
MPT-7B
MPT-30B
Falcon

And all LLMs supported by ctransformers.

UI

ialacol does not have a UI, however it's compatible with any web UI that support OpenAI API, for example chat-ui after PR #541 merged.

Assuming ialacol running at port 8000, you can configure chat-ui to use zephyr-7b-beta.Q4_K_M.gguf served by ialacol.

MODELS=`[
  {
      "name": "zephyr-7b-beta.Q4_K_M.gguf",
      "displayName": "Zephyr 7B β",
      "preprompt": "<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate.</s>\n",
      "userMessageToken": "<|user|>\n",
      "userMessageEndToken": "</s>\n",
      "assistantMessageToken": "<|assistant|>\n",
      "assistantMessageEndToken": "\n",
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "max_new_tokens": 4096,
        "truncate": 999999
      },
      "endpoints" : [{
        "type": "openai",
        "baseURL": "http://localhost:8000/v1",
        "completion": "chat_completions"
      }]
  }
]

openchat_3.5.Q4_K_M.gguf

MODELS=`[
  {
      "name": "openchat_3.5.Q4_K_M.gguf",
      "displayName": "OpenChat 3.5",
      "preprompt": "",
      "userMessageToken": "GPT4 User: ",
      "userMessageEndToken": "<|end_of_turn|>",
      "assistantMessageToken": "GPT4 Assistant: ",
      "assistantMessageEndToken": "<|end_of_turn|>",
      "parameters": {
        "temperature": 0.1,
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "top_k": 50,
        "max_new_tokens": 4096,
        "truncate": 999999,
        "stop": ["<|end_of_turn|>"]
      },
      "endpoints" : [{
        "type": "openai",
        "baseURL": "http://localhost:8000/v1",
        "completion": "chat_completions"
      }]
  }
]`

Blogs

Quick Start

Kubernetes

ialacol offer first class citizen support for Kubernetes, which means you can automate/configure everything compare to runing without.

To quickly get started with ialacol on Kubernetes, follow the steps below:

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol

By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.

Port-forward

kubectl port-forward svc/llama-2-7b-chat 8000:8000

Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin using curl

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
     http://localhost:8000/v1/chat/completions

Alternatively, using OpenAI's client library (see more examples in the examples/openai folder).

openai -k "sk-fake" \
     -b http://localhost:8000/v1 -vvvvv \
     api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin \
     -g user "Hello world!"

Configuration

All configuration is done via environmental variable.

Parameter	Description	Default	Example
`DEFAULT_MODEL_HG_REPO_ID`	The Hugging Face repo id to download the model	`None`	`TheBloke/orca_mini_3B-GGML`
`DEFAULT_MODEL_HG_REPO_REVISION`	The Hugging Face repo revision	`main`	`gptq-4bit-32g-actorder_True`
`DEFAULT_MODEL_FILE`	The file name to download from the repo, optional for GPTQ models	`None`	`orca-mini-3b.ggmlv3.q4_0.bin`
`MODE_TYPE`	Model type to override the auto model type detection	`None`	`gptq`, `gpt_bigcode`, `llama`, `mpt`, `replit`, `falcon`, `gpt_neox` `gptj`
`LOGGING_LEVEL`	Logging level	`INFO`	`DEBUG`
`TOP_K`	top-k for sampling.	`40`	Integers
`TOP_P`	top-p for sampling.	`1.0`	Floats
`REPETITION_PENALTY`	rp for sampling.	`1.1`	Floats
`LAST_N_TOKENS`	The last n tokens for repetition penalty.	`1.1`	Integers
`SEED`	The seed for sampling.	`-1`	Integers
`BATCH_SIZE`	The batch size for evaluating tokens, only for GGUF/GGML models	`8`	Integers
`THREADS`	Thread number override auto detect by CPU/2, set `1` for GPTQ models	`Auto`	Integers
`MAX_TOKENS`	The max number of token to generate	`512`	Integers
`STOP`	The token to stop the generation	`None`	`<
`CONTEXT_LENGTH`	Override the auto detect context length	`512`	Integers
`GPU_LAYERS`	The number of layers to off load to GPU	`0`	Integers
`TRUNCATE_PROMPT_LENGTH`	Truncate the prompt if set	`0`	Integers

Sampling parameters including TOP_K, TOP_P, REPETITION_PENALTY, LAST_N_TOKENS, SEED, MAX_TOKENS, STOP can be override per request via request body, for example:

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
     http://localhost:8000/v1/chat/completions

will use temperature=2, top_p=1 and top_k=0for this request.

Run in Container

Image from Github Registry

There is a image hosted on ghcr.io (alternatively CUDA11,CUDA12,METAL,GPTQ variants).

docker run --rm -it -p 8000:8000 \
     -e DEFAULT_MODEL_HG_REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
     -e DEFAULT_MODEL_FILE="llama-2-7b-chat.ggmlv3.q4_0.bin" \
     ghcr.io/chenhunghan/ialacol:latest

From Source

For developers/contributors

Python

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML" DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin" LOGGING_LEVEL="DEBUG" THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999

Docker

Build image

docker build --file ./Dockerfile -t ialacol .

Run container

export DEFAULT_MODEL_HG_REPO_ID="TheBloke/orca_mini_3B-GGML"
export DEFAULT_MODEL_FILE="orca-mini-3b.ggmlv3.q4_0.bin"
docker run --rm -it -p 8000:8000 \
     -e DEFAULT_MODEL_HG_REPO_ID=$DEFAULT_MODEL_HG_REPO_ID \
     -e DEFAULT_MODEL_FILE=$DEFAULT_MODEL_FILE ialacol

GPU Acceleration

To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS environment variable. GPU_LAYERS is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.

CUDA 11

deployment.image = ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS is the layer to off loading to GPU.

CUDA 12

deployment.image = ghcr.io/chenhunghan/ialacol-cuda12:latest
deployment.env.GPU_LAYERS is the layer to off loading to GPU.

Only llama, falcon, mpt and gpt_bigcode(StarCoder/StarChat) support CUDA.

Llama with CUDA12

helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml

Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.

StarCoderPlus with CUDA12

helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml

Deploys Starcoderplus-Guanaco-GPT4-15B-V1.0 model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.

CUDA Driver Issues

If you see CUDA driver version is insufficient for CUDA runtime version when making the request, you are likely using a Nvidia Driver that is not compatible with the CUDA version.

Upgrade the driver manually on the node (See here if you are using CUDA11 + AMI). Or try different version of CUDA.

Metal

To enable Metal support, use the image ialacol-metal built for metal.

deployment.image = ghcr.io/chenhunghan/ialacol-metal:latest

For example

helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml

GPTQ

To use GPTQ, you must

deployment.image = ghcr.io/chenhunghan/ialacol-gptq:latest
deployment.env.MODEL_TYPE = gptq

For example

helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml

kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user "Hello world!"

Tips

Copilot

ialacol can be use as a copilot client as GitHub's Copilot is almost identical API as OpenAI completion API.

However, few things need to keep in mind:

Copilot client sends a lenthy prompt, to include all the related context for code completion, see copilot-explorer, which give heavy load on the server, if you are trying to run ialacol locally, opt-in TRUNCATE_PROMPT_LENGTH environmental variable to truncate the prompt from the beginning to reduce the workload.
Copilot sends request in parallel, to increase the throughput, you probably need a queue like text-inference-batcher.

Start two instances of ialacol:

gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
LOGGING_LEVEL="DEBUG"
THREAD=2
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML"
DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
TRUNCATE_PROMPT_LENGTH=100 # optional
uvicorn main:app --host 0.0.0.0 --port 9998
uvicorn main:app --host 0.0.0.0 --port 9999

Start tib, pointing to upstream ialacol instances.

gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
UPSTREAMS="http://localhost:9998,http://localhost:9999" npm start

Configure VSCode Github Copilot to use tib.

"github.copilot.advanced": {
     "debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
     "debug.testOverrideProxyUrl": "http://localhost:8000",
     "debug.overrideProxyUrl": "http://localhost:8000"
}

Creative v.s. Conservative

LLMs are known to be sensitive to parameters, the higher temperature leads to more "randomness" hence LLM becomes more "creative", top_p and top_k also contribute to the "randomness"

If you want to make LLM be creative.

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
     http://localhost:8000/v1/chat/completions

If you want to make LLM be more consistent and genereate the same result with the same input.

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
     http://localhost:8000/v1/chat/completions

Roadmap

Support starcoder model type via ctransformers, including:
- StarChat https://huggingface.co/TheBloke/starchat-beta-GGML
- StarCoder https://huggingface.co/TheBloke/starcoder-GGML
- StarCoderPlus https://huggingface.co/TheBloke/starcoderplus-GGML
Mimic restof OpenAI API, including GET /models and POST /completions
GPU acceleration (CUDA/METAL)
Support POST /embeddings backed by huggingface Apache-2.0 embedding models such as Sentence Transformers and hkunlp/instructor
Suuport Apache-2.0 fastchat-t5-3b
Support more Apache-2.0 models such as codet5p and others listed here

Star History

Receipts

Llama-2

Deploy Meta's Llama 2 Chat model quantized by TheBloke.

7B Chat

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml

13B Chat

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml

70B Chat

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml

OpenLM Research's OpenLLaMA Models

Deploy OpenLLaMA 7B model quantized by rustformers.

ℹ️ This is a base model, likely only useful for text completion.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml

VMWare's OpenLlama 13B Open Instruct

Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml

Mosaic's MPT Models

Deploy MosaicML's MPT-7B model quantized by rustformers. ℹ️ This is a base model, likely only useful for text completion.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml

Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml

Falcon Models

Deploy Uncensored Falcon 7B model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml

Deploy Uncensored Falcon 40B model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml

StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)

Deploy starchat-beta model quantized by TheBloke.

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml

Deploy WizardCoder model quantized by TheBloke.

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml

Pythia Models

Deploy light-weight pythia-70m model with only 70 millions paramters (~40MB) quantized by rustformers.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml

RedPajama Models

Deploy RedPajama 3B model

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml

StableLM Models

Deploy stableLM 7B model

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml

Development

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt