The KubeCon AI day lightning talk to deploy vLLM server using DRA controller by NVIDIA
- Choose the NVIDIA pytorch image version that is compatible with the following driver and cuda library version.
- NVIDIA DRIVER version: 12.1
- CUDA Library version: 530.30.02
- Torch version:
torch == 2.1.2
- Python version: 3.9+
The supported PyTorch NVIDIA image is
nvcr.io/nvidia/pytorch:23.07-py3
according to the compatibility matrix.
The corresponding Dockerfile should start with the following:
# Use NVIDIA PyTorch image compatible with NVIDIA DRIVER 12.1 and CUDA Library 530.30.02
FROM nvcr.io/nvidia/pytorch:23.07-py3
- Configure the
LD_LIBRARY_PATH
env variable to include the host cuda libaries mounting path.
# Configure environment variable for CUDA libraries
ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
- Install necessary python libraries and vLLM library.
# Set the base image Python version requirement
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install --upgrade setuptools
# Install vLLM libraries
RUN python3 -m pip install vllm
- Configure a default model and allow it to be changed outside of the container.
# Define the MODEL_NAME environment variable, with a default value
ENV MODEL_NAME=facebook/opt-125m
- Start the vLLM openAI compatible API server.
# Command to start the vLLM API server
CMD python -m vllm.entrypoints.openai.api_server --model ${MODEL_NAME}
- Remove the original resource limit that is on GPU, usually the key is
nvidia.com/gpu
ornvidia.com/mig-2g.10gb
for MIG slice for example.
resources:
limits:
nvidia.com/gpu: 1
- Define ResourceClaim for nvidia GPU to be consumed by the deployment.
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
namespace: gpu-test1
name: gpu.nvidia.com
spec:
resourceClassName: gpu.nvidia.com
- Instantiate resourceClaim using the previous defined ResourceClaim in the deployment and reference the instantiated resource in the
resources
field of the deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels:
app: vllm-1gpu
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm-container
image: quay.io/chenw615/vllm_dra:latest
imagePullPolicy: IfNotPresent
command: ["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "${MODEL_NAME}"]
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: HF_TOKEN
- name: MODEL_NAME
value: "facebook/opt-125m"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
source:
resourceClaimName: gpu.nvidia.com
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: huggingface-cache-pvc
-
Follow the NVIDIA k8s-dra-driver tutorial to set up your node environment and start a kind cluster. The node env setup is skipped here.
-
Set up a
kind
cluster and install theNVIDIA k8s-dra-driver
.
First, clone the NVIDIA k8s-dra-driver repo.
git clone https://github.com/NVIDIA/k8s-dra-driver.git
cd k8s-dra-driver
Then, create a kind
cluster.
./demo/clusters/kind/create-cluster.sh
From there, we run their script to build the image for NVIDIA GPU resource driver and make the built image available to the kind
cluster.
./demo/clusters/kind/build-dra-driver.sh
Then, we install the NVIDIA GPU DRA driver.
./demo/clusters/kind/install-dra-driver.sh
- Clone this repo.
git clone https://github.com/wangchen615/vLLM-DRA
cd vLLM-DRA
- Deploy the
vllm_cache.yaml
to create secret token to download models from HuggingFace and create the persistent volume and persistent volume claim to cache models on localhost.
- Replace the
<hg_secret_token>
to the base64 encoded huggingface token.
echo -n 'your_hg_token' | base64
- Create the cache
kubectl create -f vllm_cache.yaml
- Deploy the
vllm_dra_1gpu.yaml
.
kubectl create -f vllm_dra_1gpu.yaml
- Forward the port to the localhost.
kubectl port-forward svc/vllm 8000:8000 >/dev/null 2>&1 &
- Try the following Query.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
This interactive tutotial will lead you through using DRA to dynamically create a MIG slice to deploy vLLM server on OpenShift.
-
Clone the repo
-
Run and follow the steps interactively.
./demo.sh