namudhaj

Serving LLMs (Falcon 7b instruct, Llama 2 70B, ) with GPUs on Google Kubernetes Engine (GKE)

Authors

Sepehr Ahmadi
Zafar Mahmood

Motivation

Deploying a large language model on Kubernetes with GPUs helps with quick iterative development of NLP tasks and prompt engineering.

With this motivation, we started this project as part of the ECE1779 Introduction to Cloud Computing course.

Installation steps:

Make sure you have enabled Google Kubernetes Engine API.
Also enable Container File System API.
If you want to use Llama 2 models, get access to the Meta license as the Llama models are governed by it. Follow the details here.
Create a HuggingFace token.
Set the default environment variables:

$ source ./gcs_scripts/set_environment.sh

Updated property [core/project].

Create a cluster named llm-cluster:

$ ./gcs_scripts/create_cluster.sh

Creating cluster llm-cluster in us-central1... Cluster is be
ing deployed...⠼
Creating cluster llm-cluster in us-central1... Cluster is be
ing health-checked (master is healthy)...done.
Created [https://container.googleapis.com/v1/projects/ece1779project/zones/us-central1/clusters/llm-cluster].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/llm-cluster?project=ece1779project

Create a GPU node pool:

$ ./gcs_scripts/create_gpu_node_pool.sh

Creating node pool g2-standard-24...done.
Created [https://container.googleapis.com/v1/projects/ece1779project/zones/us-central1/clusters/llm-cluster/nodePools/g2-standard-24].
NAME            MACHINE_TYPE    DISK_SIZE_GB  NODE_VERSION
g2-standard-24  g2-standard-24  100           1.27.3-gke.100

Install the GKE Auth plugin:

$ gcloud components install gke-gcloud-auth-plugin

Configure kubectl:

$ gcloud container clusters get-credentials llm-cluster --region=${REGION}

Fetching cluster endpoint and auth data.
kubeconfig entry generated for llm-cluster.

Set the HuggingFace token. Here is how you get the HuggingFace token:

$ export HF_TOKEN=HUGGING_FACE_TOKEN

Create a Kubernetes secret for the HuggingFace token:

$ ./gcs_scripts/create_hf_k8s_secret.sh

Apply the manifest:

$ kubectl apply -f configs/hf-secret.yaml

secret/llm-cluster created

Build, tag and push the docker image. Here is the DockerHub repository:

$ ./build.sh
$ docker push

Create the text generation interface k8s Deployment (with number of replicas = 1):

$ kubectl apply -f configs/text-generation-inference.yaml

deployment.apps/llm created
service/llm-service created

Deploy the frontend Gradio app (both Deployment and Service):

$ kubectl apply -f configs/gradio-tgi.yaml

zaffnet/namudhaj

namudhaj

Authors

Motivation

Installation steps: