Warning
ScaleLLM is currently in the active development stage and may not yet provide the optimal level of inference efficiency. We are fully dedicated to continuously enhancing its efficiency while also adding more features.
In the coming weeks, we have exciting plans to focus on speculative decoding and stateful conversation, alongside further kernel optimizations. We appreciate your understanding and look forward to delivering an even better solution.
- [11/2023] - First official release with support for popular open-source models.
- Overview
- Supported Models
- Get Started
- Usage Examples
- Quantization
- Limitations
- Contributing
- Acknowledgements
- License
ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.
- High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more.
- Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
- OpenAI-compatible API: An efficient golang rest api server that compatible with OpenAI.
- Huggingface models: Seamless integration with most popular HF models, supporting safetensors.
- Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
- Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
Please note that in order to use Yi models, you need to add --model_type=Yi
to the command line. For example:
docker run -it --gpus=all --net=host --shm-size=1g \
-v $HOME/.cache/huggingface/hub:/models \
-e HF_MODEL_ID=01-ai/Yi-34B-Chat-4bits \
-e DEVICE=auto \
docker.io/vectorchai/scalellm:latest --logtostderr --model_type=Yi
Models | Tensor Parallel | Quantization | Chat API | HF models examples |
---|---|---|---|---|
Yi | Yes | Yes | Yes | 01-ai/Yi-6B, 01-ai/Yi-34B-Chat-4bits, 01-ai/Yi-6B-200K |
Llama2 | Yes | Yes | Yes | meta-llama/Llama-2-7b, TheBloke/Llama-2-13B-chat-GPTQ, TheBloke/Llama-2-70B-AWQ |
Aquila | Yes | Yes | Yes | BAAI/Aquila-7B, BAAI/AquilaChat-7B |
Bloom | Yes | Yes | No | bigscience/bloom |
GPT_j | Yes | Yes | No | EleutherAI/gpt-j-6b |
GPT_NeoX | Yes | Yes | No | EleutherAI/gpt-neox-20b |
GPT2 | Yes | Yes | No | gpt2 |
InternLM | Yes | Yes | Yes | internlm/internlm-7b |
Mistral | Yes | Yes | Yes | mistralai/Mistral-7B-v0.1 |
MPT | Yes | Yes | Yes | mosaicml/mpt-30b |
If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on GitHub Issues.
The easiest way to get started with our project is by using the official Docker images. If you don't have Docker installed, please follow the installation instructions for your platform.
You can download and install Docker from the official website: Docker Installation.
Note
To use GPUs, you also need to install the NVIDIA Container Toolkit.
Once you have Docker installed, you can run ScaleLLM Docker container using the following command:
docker run -it --gpus=all --net=host --shm-size=1g \
-v $HOME/.cache/huggingface/hub:/models \
-e HF_MODEL_ID=TheBloke/Llama-2-7B-chat-AWQ \
-e DEVICE=cuda:0 \
docker.io/vectorchai/scalellm:latest --logtostderr
This command starts the Docker container with GPU support and various configuration options.
Warning
NCCL might fall back to using the host memory if NVLink or PCI is not available. To allow NCCL to use the host memory, we added '--shm-size=1g' to the docker run command.
HF_MODEL_ID
specifies which Hugging Face model you want to run.HF_MODEL_REVISION
specifies which Hugging Face model revision you want to run. By default, it is set to"main"
.HF_MODEL_ALLOW_PATTERN
specifies which types of files are allowed to be downloaded. By default, it is set to"*.json,*.safetensors,*.model"
.DEVICE
specifies the device on which this model should run. By default, it is set to"auto"
.HUGGING_FACE_HUB_TOKEN
specifies the token from huggingface for gated models.
Note
Although ScaleLLM supports bothCPU
andGPU
, we recommend using GPU for better performance. CPU support is mainly for debugging and testing purposes, so the performance might be sub-optimal. If you want to use CPU, please setDEVICE=cpu
in the command.
After running the Docker container, two ports are exposed:
-
Port 8888 for gRPC Server:
The gRPC server is served on 0.0.0.0:8888 by default. You can use gRPC to interact with the service.
-
Port 9999 for HTTP Server:
The simple HTTP server for instrument will be served on 0.0.0.0:9999 by default. This server provides various endpoints for managing and monitoring the service:
- Use
curl localhost:9999/health
to check the health status of the service. - Use
curl localhost:9999/metrics
to export Prometheus metrics. - Use
curl localhost:9999/gflags
to list all available gflags for configuration. - add more to come...
- Use
You can also start a REST API gateway using the following command:
docker run -it --net=host \
docker.io/vectorchai/scalellm-gateway:latest --logtostderr
The REST API Server is available on localhost:8080
. You can use REST API requests to interact with the system. Check out the Usage Examples section for more details.
A local Chatbot UI is also available on localhost:3000. You can start it with the following command:
docker run -it --net=host \
-e OPENAI_API_HOST=http://127.0.0.1:8080 \
-e OPENAI_API_KEY=YOUR_API_KEY \
docker.io/vectorchai/chatbot-ui:latest
Using Docker Compose is the easiest way to run ScaleLLM with all the services together. If you don't have Docker Compose installed, please follow the installation doc for your platform.
curl https://raw.githubusercontent.com/vectorch-ai/ScaleLLM/main/scalellm.yml -sSf > scalellm_compose.yml
HF_MODEL_ID=TheBloke/Llama-2-7B-chat-AWQ DEVICE=cuda docker compose -f ./scalellm_compose.yml up
you will get following running services:
- Chatbot UI on port 3000: localhost:3000
- ScaleLLM gRPC server on port 8888:
localhost:8888
- ScaleLLM HTTP server for monitoring on port 9999:
localhost:9999
- ScaleLLM REST API server on port 8080:
localhost:8080
You can get chat completions with the following example:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheBloke/Llama-2-7B-chat-AWQ",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
import os
import sys
import openai
openai.api_base = "http://localhost:8080/v1"
# List available models
print("==== Available models ====")
models = openai.Model.list()
model = "TheBloke/Llama-2-7B-chat-AWQ"
completion = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello"},
],
max_tokens=256,
stream=True,
)
print(f"==== Model: {model} ====")
for chunk in completion:
content = chunk["choices"][0]["delta"].get("content")
if content:
print(content, end="")
For regular completions, you can use this example:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheBloke/Llama-2-7B-chat-AWQ",
"prompt": "hello",
"max_tokens": 32,
"temperature": 0.7,
"stream": true
}'
import os
import sys
import openai
openai.api_base = "http://localhost:8080/v1"
# List available models
print("==== Available models ====")
models = openai.Model.list()
model = "TheBloke/Llama-2-7B-chat-AWQ"
completion = openai.Completion.create(
model=model,
prompt="hello",
max_tokens=256,
temperature=0.7,
stream=True,
)
print(f"==== Model: {model} ====")
for chunk in completion:
content = chunk["choices"][0].get("text")
if content:
print(content, end="")
Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization (APTQ) and Activation-aware Weight Quantization (AWQ), with seamless integration into the following libraries: autogptq, exllama, exllamav2, and awq.
By default, exllamav2 is employed for GPTQ 4-bit quantization. However, you have the flexibility to choose a specific implementation by configuring the "--qlinear_gptq_impl" option, which allows you to select from exllama, exllamav2, or auto option.
There are several known limitations we are looking to address in the coming months, including:
- Only supports Hugging Face models with fast tokenizers.
- Only supports GPUs that newer than Turing architecture.
If you have any questions or want to contribute, please don't hesitate to ask in our "Discussions" forum or join our "Discord" chat room. We welcome your input and contributions to make ScaleLLM even better. Please follow the Contributing.md to get started.
The following open-source projects have been used in this project, either in their original form or modified to meet our needs:
- pytorch
- FasterTransformer
- vllm
- AutoGPTQ
- llm-awq
- flash-attn
- exllama
- tokenizers
- safetensors
- sentencepiece
- grpc-gateway
- chatbot-ui
This project is released under the Apache 2.0 license.