Llama2 on GKE with GPU

This Terraform project deploys Llama2 from HuggingFace on a GPU-enabled Jupyter cluster using Google Kubernetes Engine with preemptible nodes for cost efficiency.

Architecture

GKE cluster with preemptible nodes (cost-effective)
Dedicated GPU node pool with NVIDIA T4 GPUs for Llama2 inference
JupyterHub for interactive model usage
Custom Docker image with PyTorch, HuggingFace, and Llama2 support

Prerequisites

Google Cloud SDK installed and configured
Terraform 1.0.0 or later installed
Access to a Google Cloud project with necessary APIs enabled:
- Kubernetes Engine API
- Compute Engine API
- Container Registry API

Setup

Prepare configuration

Copy the example variables file and edit as needed:

cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your project ID and any other customizations

Initialize Terraform
```
terraform init
```
Apply the configuration
```
terraform apply
```
Install NVIDIA drivers

After the cluster is created, apply the NVIDIA driver installer:
```
kubectl apply -f k8s/nvidia-driver-installer.yaml
```

Install JupyterHub

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo update
helm install jupyterhub jupyterhub/jupyterhub --values k8s/jupyter-values.yaml

Access JupyterHub
```
kubectl port-forward service/proxy-public 8080:80
```
Open http://localhost:8080 in your browser

Using Llama2

Start a new Jupyter server with the "Llama2 GPU Environment" profile
Open the example notebook or create a new one
Follow the example code to utilize the Llama2 model with GPU acceleration

Clean Up

To delete all resources:

terraform destroy

Cost Optimization

This deployment uses preemptible VMs and auto-scaling to minimize costs. The GPU nodes will scale to zero when not in use, and preemptible instances provide significant cost savings.

idvoretskyi/terraform-gke-with-llama2