LLM API
This application can be used to run LLMs (Large Language Models) in docker containers, it's built in a generic way so that it can be reused for multiple types of models.
The main motivation to start this project, was to be able to use different LLMs running on a local machine or a remote server with langchain using langchain-llm-api
Tested with the following models :
- Llama 7b - ggml
- Llama 13b - ggml
- Llama 30b - ggml
- Alpaca 7b - ggml
- Alpaca 13b - ggml
- Alpaca 30b - ggml
- Vicuna 13b - ggml
- Koala 7b - ggml
- Vicuna GPTQ 7B-4bit-128g
- Vicuna GPTQ 13B-4bit-128g
- Koala GPTQ 7B-4bit-128g
- wizardLM GPTQ 7B-4bit-128g
Contribution for supporting more models is welcomed.
roadmap
- Write an implementation for Alpaca
- Write an implementation for Llama
- Write an implementation for Vicuna
- Support GPTQ-for-LLaMa
- Lora support
- huggingface pipeline
- Support OpenAI
- Support RWKV-LM
Usage
In order to run this API on a local machine, a running docker engine is needed.
run using docker:
create a config.yaml
file with the configs described below and then run:
docker run -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 --ulimit memlock=16000000000 1b5d/llm-api
or use the docker-compose.yaml
in this repo and run using compose:
docker compose up
When running for the first time, the app will download the model from huggingface based on the configurations in setup_params
and name the local model file accordingly, on later runs it looks up the same local file and loads it into memory
Config
to configure the application, edit config.yaml
which is mounted into the docker container, the config file looks like this:
models_dir: /models # dir inside the container
model_family: alpaca
setup_params:
key: value
model_params:
key: value
setup_params
and model_params
are model specific, see below for model specific configs.
You can override any of the above mentioned configs using environment vars prefixed with LLM_API_
for example: LLM_API_MODELS_DIR=/models
Endpoints
In general all LLMs will have a generalized set of endpoints
POST /generate
{
"prompt": "What is the capital of France?",
"params": {
...
}
}
POST /agenerate
{
"prompt": "What is the capital of France?",
"params": {
...
}
}
POST /embeddings
{
"text": "What is the capital of France?"
}
Llama on CPU - using llama.cpp
Llama and models based on it such as Alpaca and Vicuna are intended only for academic research and any commercial use is prohibited. This project doesn't provide any links to download these models.
You can configure the model usage in a local config.yaml
file, the configs, here is an example:
models_dir: /models # dir inside the container
model_family: alpaca
setup_params:
repo_id: user/repo_id
filename: ggml-model-q4_0.bin
convert: false
migrate: false
model_params:
ctx_size: 2000
seed: -1
n_threads: 8
n_batch: 2048
n_parts: -1
last_n_tokens_size: 16
Fill repo_id
and filename
to a huggingface repo where a model is hosted, and let the application download it for you.
convert
refers to https://github.com/ggerganov/llama.cpp/blob/master/convert-unversioned-ggml-to-ggml.py set this to true when you need to use an older model which needs to be convertedmigrate
refers to https://github.com/ggerganov/llama.cpp/blob/master/migrate-ggml-2023-03-30-pr613.py set this to true when you need to apply this script to an older model which needs to be migrated
the following example shows the different params you can sent to Alpaca generate and agenerate endpoints:
POST /generate
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of paris",
"params": {
"n_predict": 300,
"temp": 0.1,
"top_k": 40,
"top_p": 0.95,
"stop": ["\Q"],
"repeat_penalty": 1.3
}
}'
Llama / Alpaca on GPU - using GPTQ-for-LLaMa (beta)
Note: According to nvidia-docker, you might want to install the NVIDIA Driver on your host machine. Verify that your nvidia environment is properly by running this:
docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
You should see a table showing you the current nvidia driver version and some other info:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 11.7 |
|-----------------------------------------+----------------------+----------------------+
...
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
You can also run the Llama model using GPTQ-for-LLaMa 4 bit quantization, you can use a docker image specially built for that purpose 1b5d/llm-api:0.0.4-gptq-llama-triton
instead of the default image.
a separate docker-compose file is also available to run this mode:
docker compose -f docker-compose.gptq-llama-triton.yaml up
or by directly running the container:
docker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:0.0.4-gptq-llama-triton
Note: llm-api:0.0.x-gptq-llama-cuda
image has been deprecated, please switch to the triton image as it seems more reliable
Example config file:
models_dir: /models
model_family: gptq_llama
setup_params:
repo_id: user/repo_id
filename: <model.safetensors or model.pt>
model_params:
group_size: 128
wbits: 4
cuda_visible_devices: "0"
device: "cuda:0"
st_device: 0
Note: st_device
is only needed in the case of safetensors model, otherwise you can either remove it or set it to -1
Example request:
POST /generate
curl --location 'localhost:8000/generate' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "What is the capital of paris",
"params": {
"temp": 0.8,
"top_p": 0.95,
"min_length": 10,
"max_length": 50
}
}'
Credits
credits goes to
- llama.cpp for making it possible to run Llama and Alpaca models on CPU.
- serge for providing an example on how to build an API using FastApi
- llama-cpp-python for the python bindings lib for
llama.cpp
- GPTQ-for-LLaMa for providing a GPTQ implementation for Llama based models.