🚀 | This serverless worker utilizes vLLM behind the scenes and is integrated into RunPod's serverless environment. It supports dynamic auto-scaling using the built-in RunPod autoscaling feature.
We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:
runpod/worker-vllm:dev
-
Required:
MODEL_NAME
: Hugging Face Model Repository (e.g.,openchat/openchat-3.5-1210
).
-
Optional:
MODEL_BASE_PATH
: Model storage directory (default:/runpod-volume
).HF_TOKEN
: Hugging Face token for private and gated models (e.g., Llama, Falcon).NUM_GPU_SHARD
: Number of GPUs to split the model across (default:1
).QUANTIZATION
: AWQ (awq
) or SqueezeLLM (squeezellm
) quantization.MAX_CONCURRENCY
: Max concurrent requests (default:100
).DEFAULT_BATCH_SIZE
: Token streaming batch size (default:30
). This reduces the number of HTTP calls, increasing speed 8-10x vs non-batching, matching non-streaming performance.DISABLE_LOG_STATS
: Enable (0
) or disable (1
) vLLM stats logging.DISABLE_LOG_REQUESTS
: Enable (0
) or disable (1
) request logging.
To build an image with the model baked in, you must specify the following docker arguments when building the image:
- Required
MODEL_NAME
- Optional
MODEL_BASE_PATH
: Defaults to/runpod-volume
for network storage. Use/models
or for local container storage.QUANTIZATION
HF_TOKEN
WORKER_CUDA_VERSION
:11.8
or12.1
(default:11.8
due to a small amount of workers not having CUDA 12.1 support yet.12.1
is recommended for optimal performance).
sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg MODEL_BASE_PATH="/models" .
- LLaMA & LLaMA-2 (
meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,young-geng/koala
,openlm-research/open_llama_13b
, etc.) - Mistral (
mistralai/Mistral-7B-v0.1
,mistralai/Mistral-7B-Instruct-v0.1
, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1
,mistralai/Mixtral-8x7B-Instruct-v0.1
, etc.) - Aquila & Aquila2 (
BAAI/AquilaChat2-7B
,BAAI/AquilaChat2-34B
,BAAI/Aquila-7B
,BAAI/AquilaChat-7B
, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat
,baichuan-inc/Baichuan-7B
, etc.) - BLOOM (
bigscience/bloom
,bigscience/bloomz
, etc.) - ChatGLM (
THUDM/chatglm2-6b
,THUDM/chatglm3-6b
, etc.) - Falcon (
tiiuae/falcon-7b
,tiiuae/falcon-40b
,tiiuae/falcon-rw-7b
, etc.) - GPT-2 (
gpt2
,gpt2-xl
, etc.) - GPT BigCode (
bigcode/starcoder
,bigcode/gpt_bigcode-santacoder
, etc.) - GPT-J (
EleutherAI/gpt-j-6b
,nomic-ai/gpt4all-j
, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b
,databricks/dolly-v2-12b
,stabilityai/stablelm-tuned-alpha-7b
, etc.) - InternLM (
internlm/internlm-7b
,internlm/internlm-chat-7b
, etc.) - MPT (
mosaicml/mpt-7b
,mosaicml/mpt-30b
, etc.) - OPT (
facebook/opt-66b
,facebook/opt-iml-max-30b
, etc.) - Phi (
microsoft/phi-1_5
,microsoft/phi-2
, etc.) - Qwen (
Qwen/Qwen-7B
,Qwen/Qwen-7B-Chat
, etc.) - Yi (
01-ai/Yi-6B
,01-ai/Yi-34B
, etc.)
And any other models supported by vLLM 0.2.6.
Ensure that you have Docker installed and properly set up before running the docker build commands. Once built, you can deploy this serverless worker in your desired environment with confidence that it will automatically scale based on demand. For further inquiries or assistance, feel free to contact our support team.
You may either use a prompt
or a list of messages
as input. If you use messages
, the model's chat template will be applied to the messages automatically, so the model must have one. If you use prompt
, you may optionally apply the model's chat template to the prompt by setting apply_chat_template
to true
.
Argument | Type | Default | Description |
---|---|---|---|
prompt |
str | Prompt string to generate text based on. | |
messages |
list[dict[str, str]] | List of messages, which will automatically have the model's chat template applied. Overrides prompt . |
|
apply_chat_template |
bool | False | Whether to apply the model's chat template to the prompt . |
sampling_params |
dict | {} | Sampling parameters to control the generation, like temperature, top_p, etc. |
stream |
bool | False | Whether to enable streaming of output. If True, responses are streamed as they are generated. |
batch_size |
int | DEFAULT_BATCH_SIZE | The number of tokens to stream every HTTP POST call. |
Your list can contain any number of messages, and each message can have any role from the following list:
user
assistant
system
The model's chat template will be applied to the messages automatically.
Example:
[
{
"role": "system",
"content": "..."
},
{
"role": "user",
"content": "..."
},
{
"role": "assistant",
"content": "..."
}
]
Argument | Type | Default | Description |
---|---|---|---|
best_of |
Optional[int] | None | Number of output sequences generated from the prompt. The top n sequences are returned from these best_of sequences. Must be ≥ n . Treated as beam width in beam search. Default is n . |
presence_penalty |
float | 0.0 | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
frequency_penalty |
float | 0.0 | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
repetition_penalty |
float | 1.0 | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. |
temperature |
float | 1.0 | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. |
top_p |
float | 1.0 | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
top_k |
int | -1 | Controls the number of top tokens to consider. Set to -1 to consider all tokens. |
min_p |
float | 0.0 | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
use_beam_search |
bool | False | Whether to use beam search instead of sampling. |
length_penalty |
float | 1.0 | Penalizes sequences based on their length. Used in beam search. |
early_stopping |
Union[bool, str] | False | Controls stopping condition in beam search. Can be True , False , or "never" . |
stop |
Union[None, str, List[str]] | None | List of strings that stop generation when produced. Output will not contain these strings. |
stop_token_ids |
Optional[List[int]] | None | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. |
ignore_eos |
bool | False | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. |
max_tokens |
int | 16 | Maximum number of tokens to generate per output sequence. |
skip_special_tokens |
bool | True | Whether to skip special tokens in the output. |
spaces_between_special_tokens |
bool | True | Whether to add spaces between special tokens in the output. |