This guide uses Nvidia's DeepOps project to deploy a Kubernetes cluster on existing nodes using Ansible. Detailed info about DeepOps can be found in its repository at https://github.com/NVIDIA/deepops.
This guide is NOT using OKE.
The images you use for management and worker nodes matter. The Kubernetes cluster in this guide will use Node Feature Discovery (NFD) for labeling the nodes automatically. If you use an image with OFED drivers installed for the management nodes, NFD will incorrectly label them as RDMA capable.
This guide assumes that you use the following images:
Management: Canonical-Ubuntu-20.04-2023.02.15-0 (OCI platform image - any Ubuntu 20.04 image without the OFED drivers should work)
Worker: Ubuntu-20-OFED-5.4-3.6.8.1-GPU-515-2023.01.10-0
1 - Deploy provisioning node, Kubernetes management and worker nodes.
Deploy the necessary nodes prior to following the steps in this guide. This guide is based on a single Kubernetes management node and 2 Kubernetes GPU worker nodes in a cluster network.
The easiest way to deploy and configure the nodes is using the HPC stack. v2.10.1 of the stack has an option to deploy a login node. You can find the copy of the stack in this repo (oci-hpc-clusternetwork-2.10.1.zip
).
You can use the bastion as the provisioner node, and the login node as the management node.
As a minimum, you will need:
- 1 provisioning node (use the bastion node created by the HPC stack)
- 1 management node (use the login node created by the HPC stack)
- 1 GPU worker node with RDMA NICs
2 - SSH into the provisioning (bastion) node, and clone the DeepOps repository.
git clone https://github.com/NVIDIA/deepops.git
3 - Set up your provisioning node.
This will install Ansible and other software on the provisioning machine which will be used to deploy all other software to the cluster. For more information on Ansible and why we use it, consult the Ansible Guide.
cd deepops
./scripts/setup.sh
4 - Create and edit the Ansible inventory.
Ansible uses an inventory which outlines the servers in your cluster. The setup script from the previous step will copy an example inventory configuration to the config
directory.
Edit the inventory file in config/inventory
. An example inventory file with the Kubernetes related parts:
#
# Server Inventory File
#
# Uncomment and change the IP addresses in this file to match your environment
# Define per-group or per-host configuration in group_vars/*.yml
######
# ALL NODES
# NOTE: Use existing hostnames here, DeepOps will configure server hostnames to match these values
######
[all]
mgmt01 ansible_host=10.0.0.76
#mgmt02 ansible_host=10.0.0.2
#mgmt03 ansible_host=10.0.0.3
#login01 ansible_host=10.0.1.1
gpu01 ansible_host=10.0.0.201
gpu02 ansible_host=10.0.0.189
######
# KUBERNETES
######
[kube-master]
mgmt01
#mgmt02
#mgmt03
# Odd number of nodes required
[etcd]
mgmt01
#mgmt02
#mgmt03
# Also add mgmt/master nodes here if they will run non-control plane jobs
[kube-node]
gpu01
gpu02
[k8s-cluster:children]
kube-master
kube-node
5 - Change gpu_operator_preinstalled_nvidia_software
to false
in deepops/config/group_vars/k8s-cluster.yml
. This option will use the existing GPU drivers on the node instead of reinstalling it.
# Install NVIDIA Driver and nvidia-docker on node (true), not as part of GPU Operator (driver container, nvidia-toolkit) (false)
gpu_operator_preinstalled_nvidia_software: false
6 - Verify the configuration.
ansible all -m raw -a "hostname"
7 - Install Kubernetes using Ansible and Kubespray.
ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml
8 - Verify that the Kubernetes cluster is running.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gpu01 Ready <none> 122m v1.23.7
gpu02 Ready <none> 122m v1.23.7
mgmt01 Ready control-plane,master 132m v1.23.7
9 - Deploy the config map for Mellanox RDMA Shared Device Plugin.
Save the following file as configmap.yaml
and deploy it using kubectl apply -f configmap.yaml
.
apiVersion: v1
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [
{
"resourceName": "roce",
"rdmaHcaMax": 16,
"selectors": {
"drivers": [
"mlx5_core"
]
}
}
]
}
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
Check that the config map is there:
kubectl get configmap -n kube-system | grep rdma-devices
rdma-devices 1 98m
10 - Deploy the Mellanox RDMA Shared Device Plugin daemonset.
Save the following file as rdma-ds.yaml
and deploy it using kubectl apply -f rdma-ds.yaml
.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: rdma-shared-dp-ds
namespace: kube-system
spec:
selector:
matchLabels:
name: rdma-shared-dp-ds
template:
metadata:
labels:
name: rdma-shared-dp-ds
spec:
hostNetwork: true
priorityClassName: system-node-critical
nodeSelector:
feature.node.kubernetes.io/custom-rdma.capable: "true"
containers:
- image: mellanox/k8s-rdma-shared-dev-plugin
name: k8s-rdma-shared-dp-ds
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/
- name: config
mountPath: /k8s-rdma-shared-dev-plugin
- name: devs
mountPath: /dev/
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/
- name: config
configMap:
name: rdma-devices
items:
- key: config.json
path: config.json
- name: devs
hostPath:
path: /dev/
Check that the Daemonset is deployed correctly only on the nodes that has RDMA NICs (in our example, GPU nodes). You should see a pod running in each GPU node.
kubectl get pods -n kube-system -o wide | grep rdma-shared-dp-ds
rdma-shared-dp-ds-5sk7t 1/1 Running 0 94m 10.0.0.189 gpu02 <none> <none>
rdma-shared-dp-ds-lzjgc 1/1 Running 0 94m 10.0.0.201 gpu01 <none> <none>
11 - Now let's test running ib_write_bw
between two pods. Save the following file as rdma-test.yaml
and deploy it.
NOTE: When creating a pod spec, make sure that you're setting hostNetwork: true
and request rdma/roce: 1
as a resource. The following example spec has both options set.
kubectl apply -f rdma-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod-1
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
restartPolicy: OnFailure
containers:
- image: oguzpastirmaci/mofed-perftest:5.4-3.6.8.1-ubuntu20.04-amd64
name: mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
rdma/roce: 1
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/net
sleep 1000000
---
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod-2
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
restartPolicy: OnFailure
containers:
- image: oguzpastirmaci/mofed-perftest:5.4-3.6.8.1-ubuntu20.04-amd64
name: mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
rdma/roce: 1
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/net
sleep 1000000
12 - Wait until both pods are in Running state.
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rdma-test-pod-1 1/1 Running 0 49m 10.0.0.125 gpu02 <none> <none>
rdma-test-pod-2 1/1 Running 0 49m 10.0.0.70 gpu01 <none> <none>
13 - After the pods are running, open two terminals and exec into the pods.
kubectl exec -it rdma-test-pod-1 -- bash
kubectl exec -it rdma-test-pod-2 -- bash
14 - Confirm that you see the RDMA interfaces inside the pods.
ip ad | grep rdma
2: rdma4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 192.168.4.201/16 brd 192.168.255.255 scope global rdma4
3: rdma5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 192.168.5.201/16 brd 192.168.255.255 scope global rdma5
.....
15 - Run a basic ib_write_bw
test between pods. Make sure the RDMA interface you use matches the RDMA IP (192.168.x.x). You can get the device/interface matching with the ibdev2netdev
command.
In rdma-test-pod-1
run:
ib_write_bw -d mlx5_3 -a -F --report_gbits
In rdma-test-pod-2
run:
ib_write_bw -F -d mlx5_3 <IP of mlx5_3 in rdma-test-pod-1> -D 10 --cpu_util --report_gbits
You should see a result similar to below:
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_3
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 2
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x10107 PSN 0xb5554 RKey 0x04ad6e VAddr 0x007f8389beb000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:07:201
remote address: LID 0000 QPN 0x10107 PSN 0x92e7fa RKey 0x047c3b VAddr 0x007fd6f366c000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:07:189
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1101500 0.00 96.25 0.183585
---------------------------------------------------------------------------------------
To add K8s nodes, modify the config/inventory
file to include the new nodes under [all]
. Then list the nodes as relevant under the [kube-master]
, [etcd]
, and [kube-node]
sections. For example, if adding a new master node, list it under kube-master and etcd. A new worker node would go under kube-node.
Then run the Kubespray scale.yml
playbook...
ansible-playbook -l k8s-cluster submodules/kubespray/scale.yml
More information on this topic may be found in the Kubespray docs.
Removing nodes can be performed with Kubespray's remove-node.yml
playbook and supplying the node names as extra vars...
ansible-playbook submodules/kubespray/remove-node.yml --extra-vars "node=nodename0,nodename1"
This will drain nodename0
& nodename1
, stop Kubernetes services, delete certificates, and finally execute the kubectl command to delete the nodes.
More information on this topic may be found in the Kubespray docs.
DeepOps is largely idempotent, but in some cases, it is helpful to completely reset a cluster. KubeSpray provides a best-effort attempt at this through a playbook. The script below is recommended to be run twice as some components may not completely uninstall due to time-outs/failed dependent conditions.
ansible-playbook submodules/kubespray/reset.yml
Refer to the Kubespray Upgrade docs for instructions on how to upgrade the cluster.
Deploy Prometheus and Grafana to monitor Kubernetes and cluster nodes:
./scripts/k8s/deploy_monitoring.sh
Available Flags:
-h This message.
-p Print monitoring URLs.
-d Delete monitoring namespace and crds. Note, this may delete PVs storing prometheus metrics.
-x Disable persistent data, this deploys Prometheus with no PV backing resulting in a loss of data across reboots.
-w Wait and poll the grafana/prometheus/alertmanager URLs until they properly return.
delete Legacy positional argument for delete. Same as -d flag.
The services can be reached from the following addresses:
- Grafana: http://<kube-master>:30200
- Prometheus: http://<kube-master>:30500
- Alertmanager: http://<kube-master>:30400
DeepOps deploys monitoring services using the prometheus-operator project. For documentation on configuring and managing the monitoring services, please see the prometheus-operator user guides.