# Deploy the Kernel Module Management Operator
$ kubectl apply -k https://github.com/kubernetes-sigs/kernel-module-management/config/default
# Deploy the Habana AI Operator
$ git clone https://github.com/fabiendupont/habana-ai-operator.git && cd habana-ai-operator
$ make deploy
# Deploy the NodeFeatureDiscovery Operator
$ kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery-operator/config/default
# Deploy a NodeFeatureDiscovery instance with the extra namespace `habana.ai`
$ kubectl apply -f hack/openshift/nfd-instance.yaml
# Create a sample DeviceConfig that targets all DL1 nodes.
$ kubectl apply -k config/samples/habana.ai_v1alpha1_deviceconfig.yaml
# Wait until all Habana AI components are healthy
$ kubectl get -n habana-ai-operator get all
# Run a sample workload pod
$ cat <<EOF kubectl -f -
apiVersion: v1
kind: Pod
metadata:
name: habana-ai-demo
namespace: habana-ai-operator
spec:
restartPolicy: OnFailure
containers:
- image: vault.habana.ai/gaudi-docker/1.6.1/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.9.1:latest
imagePullPolicy: IfNotPresent
name: habana-ai-base-container
command:
- "hl-smi"
args:
- "-L"
resources:
limits:
habana.ai/gaudi: "1"
securityContext:
capabilities:
add:
- "SYS_RAWIO"
EOF
$ kubectl logs -n default pod/hl-smi
Kubernetes provides access to special hardware resources such as Habana AI accelerators, NICs, and other devices through the device plugin framework.
However, configuring and managing Kubernetes nodes with these hardware resources requires configuration of multiple software components, such as drivers, container runtimes, device plugins and other libraries, which is time consuming and error prone.
The Habana AI Operator is used in paar with the Kernel Module Management Operator, in order to automate the management of all the Habana AI software components needed to provision AI accelerators within Kubernetes, from the drivers to the respective monitoring metrics.
The components managed by the operator are:
For a detailed description of the design of the Habana AI Operator check this doc.
Use the Habana AI Operator, along with the Kernel Module Management Operator in your OpenShift cluster to automatically provision and manage different AI accelerators configurations per node group.
The following guide leverages the automatically generated container images of:
- the Habana AI Operator
- the operator OLM bundle
- the OLM CatalogSource
# Given an OpenShift cluster with Habana AI powered nodes
$ oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0 True False 30h Cluster version is 4.11.0
# Deploy the Kernel Module Management Operator
$ oc apply -k https://github.com/kubernetes-sigs/kernel-module-management/config/default
# Clone this repository
$ git clone https://github.com/fabiendupont/habana-ai-operator.git && cd habana-ai-operator
# Enable alternative firmware path on worker nodes
$ oc apply -f hack/openshift/machineconfig-firmware-path-worker.yaml
# When using Single Node Openshift (SNO) the node is under both master and worker Roles, but when loking over the Machine Config Pools, the node belongs only to master
# Run the following command is required only when Habana Labs cards are available on master Machine Config Pool nodes:
# Enable alternative firmware path on master nodes
$ oc apply -f hack/openshift/machineconfig-firmware-path-master.yaml
# Deploy the NodeFeatureDiscovery operator
$ oc apply -f hack/openshift/nfd-install.yaml
# Deploy an NodeFeatureDiscovery instance with the extra namespace: `habana.ai`
$ oc apply -f hack/openshift/nfd-instance.yaml
# Deploy the Habana AI operator
$ make deploy
# Create a sample DeviceConfig targeting Habana AI nodes.
$ oc apply -f hack/openshift/deviceconfig.yaml
# Wait for all Habana AI components to be healthy
$ oc get -n habana-ai-operator all
# Verify the setup by running a sample workload pod
$ oc apply -f hack/openshift/sample-workload.yaml
# Check the workload logs
$ oc logs -n habana-ai-operator habana-ai-demo