/runpod-worker-vllm

Runpod VLLM Worker that Works !

Primary LanguagePythonMIT LicenseMIT

vLLM Worker on Runpod Serverless

This repo consist of worker code that you can deploy to a docker container and use it on Runpod Serverless. It uses vLLM under the hood to run inference on a given model. It supports wide range of LLMs including Llama2, Mistral, Falcon, StarCoder, BLOOM, and many more ! (Check out all supported models here)

📝 Table of Contents

🌟 How to use

  1. Clone this repository
  2. build docker image
  3. push docker image to your docker registry
  4. deploy to Runpod Serverless

🏗️ build docker image (Optional)

docker build -t <your docker registry>/<your docker image name>:<your docker image tag> .

Push the image to docker with the following command:

docker push <your docker registry>/<your docker image name>:<your docker image tag>

Or you can use lobstrate/runpod-worker-vllm image from docker hub

🚀 deploy to Runpod Serverless

After having docker image on your docker registry, you can deploy to Runpod Serverless. Here is the step by step guide on how you can deploy this on runpod. In this guide, we will set up network volume so that we can download our model from huggingface hub into the network volume. At the end, when our endpoint gets its first request, it will download the model from huggingface hub into the network volume. After that, it will use the model from the network volume for inference. on subsequent requests. (Even if the worker gets scaled down to 0, the model will be persisted in the network volume)

1. Create a new Network Volume

You need to create a network volume for the worker to download your LLM model into from huggingface hub. You can create a network volume from the Runpod UI.

Create New Network Volume

  1. Click on Storage from runpod sidebar under serverless tab.
  2. Click on + Network Volume button.
  3. Select a datacenter region closest to your users.
  4. Give a name to your network volume.
  5. Select a size for your network volume.
  6. Click on Create button.

Note: To get a rough estimate on how much storage you need, you can check out your model size on https://huggingface.co. Click on files and Versions tab and check how much storage you need to store all the files.

2. Create a new Template

After creating a network volume, you need to create a template for your worker to use. For this,

Create New Template

  1. Click on Custom Templates from runpod sidebar under serverless tab.
  2. Click on New Template button.
  3. Give a name to your template.
  4. Enter your docker image name in the Container Image field. This is the same image you pushed to your docker registry in the previous step. (Or you can enter lobstrate/runpod-worker-vllm:latest image from docker hub)
  5. Select Container disk size. (This doesn't matter much as we are using network volume for model storage)
  6. [IMPORTANT] Enter environment variables for your model. MODEL_NAME is required. Which is used to download your model from huggingface hub. (refer Environment Variables section for more details)

3. Create a new Endpoint

After creating a template, you need to create an endpoint for your worker to use. For this,

Create New Endpoint

  1. Click on Endpoints from runpod sidebar under serverless tab.
  2. Click on New Endpoint button.
  3. Give a name to your endpoint.
  4. Select template you created in the previous step.
  5. Select GPU type. You can follow this guide to select the right GPU type for your model. GPU Type Guide
  6. Enter active and max worker counts
  7. Check the fast boot option
  8. Select network volume you created in the previous step.
  9. Click on Create button.

4. Test your endpoint

After creating an endpoint, you can test out your endpoint inside runpod UI. For this,

Test Endpoint

  1. Click on Requests tab from your endpoint page.
  2. Click on Run button.

You can also modify your request body. Check out Request Body section for more details.

📦 Request Body

This is the request body you can send to your endpoint:

{
  "input": {
    "prompt": "Say, Hello World!",
    "max_tokens": 50,
    // other params...
  } 
}

All the params you can send to your endpoint are listed here:

  1. prompt: The prompt you want to generate from.
  2. max_tokens: Maximum number of tokens to generate per output sequence.
  3. n: Number of output sequences to return for the given prompt.
  4. best_of: Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.
  5. presence_penalty: Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
  6. frequency_penalty: Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
  7. repetition_penalty: Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
  8. temperature: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
  9. top_p: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
  10. top_k: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
  11. use_beam_search: Whether to use beam search instead of sampling.
  12. length_penalty: Float that penalizes sequences based on their length. Used in beam search.
  13. early_stopping: Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
  14. stop: List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
  15. stop_token_ids: List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are sepcial tokens.
  16. ignore_eos: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
  17. logprobs: Number of log probabilities to return per output token. Note that the implementation follows the OpenAI API: The return result includes the log probabilities on the logprobs most likely tokens, as well the chosen tokens. The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response.
  18. prompt_logprobs: Number of log probabilities to return per prompt token.
  19. skip_special_tokens: Whether to skip special tokens in the output.
  20. spaces_between_special_tokens: Whether to add spaces between special tokens in the output. Defaults to True.
  21. logits_processors: List of functions that modify logits based on previously generated tokens.

🔗 Environment Variables

These are the environment variables you can define on your runpod template:

key value optional
MODEL_NAME your model name false
HF_HOME /runpod-volume true
HUGGING_FACE_HUB_TOKEN your huggingface token true
MODEL_REVISION your model revision true
MODEL_BASE_PATH your model base path true
TOKENIZER your tokenizer true

Note: You can get your huggingface token from https://huggingface.co/settings/token

🚀 GPU Type Guide

Here is a rough estimate on how much VRAM you need for your model. You can use this table to select the right GPU type for your model.

Model Parameters Storage & VRAM
7B 6GB
13B 9GB
33B 19GB
65B 35GB
70B 38GB

📝 License

This project is licensed under the MIT License - see the LICENSE file for details

📚 References

  1. Runpod Serverless
  2. vLLM
  3. Huggingface

🙏 Thanks

Special thanks to @Jorghi12 and @ashleykleynhans for helping out with this project.