/namudhaj

Serving Llama 2 70B with GPUs on Google Kubernetes Engine (GKE)

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

namudhaj

Serving LLMs (Falcon 7b instruct, Llama 2 70B, ) with GPUs on Google Kubernetes Engine (GKE)

Authors

  • Sepehr Ahmadi
  • Zafar Mahmood

Motivation

Deploying a large language model on Kubernetes with GPUs helps with quick iterative development of NLP tasks and prompt engineering.

With this motivation, we started this project as part of the ECE1779 Introduction to Cloud Computing course.

Installation steps:

  1. Make sure you have enabled Google Kubernetes Engine API.

  2. Also enable Container File System API.

  3. If you want to use Llama 2 models, get access to the Meta license as the Llama models are governed by it. Follow the details here.

  4. Create a HuggingFace token.

  5. Set the default environment variables:

$ source ./gcs_scripts/set_environment.sh

Updated property [core/project].
  1. Create a cluster named llm-cluster:
$ ./gcs_scripts/create_cluster.sh

Creating cluster llm-cluster in us-central1... Cluster is be
ing deployed...⠼
Creating cluster llm-cluster in us-central1... Cluster is be
ing health-checked (master is healthy)...done.
Created [https://container.googleapis.com/v1/projects/ece1779project/zones/us-central1/clusters/llm-cluster].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/llm-cluster?project=ece1779project
  1. Create a GPU node pool:
$ ./gcs_scripts/create_gpu_node_pool.sh

Creating node pool g2-standard-24...done.
Created [https://container.googleapis.com/v1/projects/ece1779project/zones/us-central1/clusters/llm-cluster/nodePools/g2-standard-24].
NAME            MACHINE_TYPE    DISK_SIZE_GB  NODE_VERSION
g2-standard-24  g2-standard-24  100           1.27.3-gke.100
  1. Install the GKE Auth plugin:
$ gcloud components install gke-gcloud-auth-plugin
  1. Configure kubectl:
$ gcloud container clusters get-credentials llm-cluster --region=${REGION}

Fetching cluster endpoint and auth data.
kubeconfig entry generated for llm-cluster.
  1. Set the HuggingFace token. Here is how you get the HuggingFace token:
$ export HF_TOKEN=HUGGING_FACE_TOKEN
  1. Create a Kubernetes secret for the HuggingFace token:
$ ./gcs_scripts/create_hf_k8s_secret.sh
  1. Apply the manifest:
$ kubectl apply -f configs/hf-secret.yaml

secret/llm-cluster created
  1. Build, tag and push the docker image. Here is the DockerHub repository:
$ ./build.sh
$ docker push 
  1. Create the text generation interface k8s Deployment (with number of replicas = 1):
$ kubectl apply -f configs/text-generation-inference.yaml

deployment.apps/llm created
service/llm-service created
  1. Deploy the frontend Gradio app (both Deployment and Service):
$ kubectl apply -f configs/gradio-tgi.yaml