Graphics Processing Units (GPUs) were originally invented to allow application developers to program 3D graphics accelerators to render photo realistic images in real time. The key is GPUs accelerate matrix and vector math operations (dot product, cross product and matrix multiplies). It turns out these math operations are used in many applications besides 3D graphics including high performance computing and machine learning. As a result, software libraries (i.e. CUDA) were developed to allow non-graphics or general purpose computing applications to take advantage of GPU hardware.
GPUs provide a programable graphics pipeline for realistic image rendering.
The following NVIDIA® software is required for GPU support.
NVIDIA® GPU drivers version 450.80.02 or higher.
CUDA® Toolkit 11.2.
cuDNN SDK 8.1.0.
(Optional) TensorRT to improve latency and throughput for inference.
The easy way to install the NVIDIA stack is to use the Ansible based provisioning.
Once the stack is built, running the nvidia-smi
utility will cause the kernel modules to load will verify the driver installation.
nvidia-smi
Example output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 26W / 70W | 0MiB / 15360MiB | 10% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Now the system should be ready to run a gpu workload. Run a simple test to validate the software stack.
Create a python environment and install tensorflow.
python3 -m venv venv
source venv/bin/activate
pip install pip tensorflow -U
Run a simple tensorflow test to confirm a GPU device is found.
ipython
>>> import tensorflow as tf
>>> tf.config.list_logical_devices()
Sample output.
Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14644 MB
memory: -> device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5
[LogicalDevice(name='/device:CPU:0', device_type='CPU'),
LogicalDevice(name='/device:GPU:0', device_type='GPU')]
Run the script to test the tensorflow devices.
python src/tf-test.py
Compare the CPU vs. GPU elapsed time in the output.
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Matrix Multiply Elapsed Time: {'CPU': 6.495161056518555, 'GPU': 0.9890825748443604}
4.2.1. For RHEL8.x, install the NVIDIA container toolkit.
4.2.2. For RHEL9.x, follow this blog post to
install the NVIDIA container toolkit.
The easy way to install the NVIDIA stack is to use the Ansible based provisioning
Wait for all the pods to have a completed or running status. This could take several minutes.
oc get pods -n nvidia-gpu-operator
The daemonset pods will build a driver for each node with a GPU.
oc logs nvidia-driver-daemonset-410.84.202204112301-0-gf4t4 -n nvidia-gpu-operator nvidia-driver-ctr --follow
Tue May 17 19:41:23 UTC 2022 Waiting for openshift-driver-toolkit-ctr container to build the precompiled driver ...
Check the logs from one of the nvidia-cuda-validator
pods.
oc logs -n nvidia-gpu-operator nvidia-cuda-validator-qpqcg
cuda workload validation is successful
Create a project as a cluster-admin user then create a Jupyter/Tensorflow application and expose it’s service as an Openshift route.
APP_NAME=tensorflow
oc new-project gputest
oc new-app --name=${APP_NAME} docker.io/tensorflow/tensorflow:latest-gpu-jupyter
oc expose service/tensorflow
ROUTE=http://$(oc get route ${APP_NAME} --template={{.spec.host}})
Dump the logs of the tensorflow pod to obtain the jupyter lab token.
POD=$(oc get pods --selector=deployment=tensorflow --output=custom-columns=:.metadata.name --no-headers)
oc logs ${POD} | grep token
http://tensorflow-544f7d6b5b-m8sjg:8888/?token=7f5cfa6a9780fd77594c1e6a45ae88002169e98d87a38580
It may be necessary to set the nvidia.com/gpu=1
limit to ensure the pod get scheduled on a node with a GPU.
oc set resources deployment/${APP_NAME} --requests=nvidia.com/gpu=1 --limits=nvidia.com/gpu=1
Connect to the tensorflow pod and run a quick GPU test.
oc rsh ${POD} bash
tf-docker /tf > python
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
>>> import tensorflow as tf
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> exit()
$
Install the GPU console plugin dashboard by following the Openshift documentation
Create a new project for OpenDataHub.
Using the Openshift web console, create an instance of the ODH operator in this project.
Create an ODH instance in your namespace.
Create the CUDA enabled notebook image streams.
oc apply -f https://raw.githubusercontent.com/red-hat-data-services/odh-manifests/master/jupyterhub/notebook-images/overlays/additional/tensorflow-notebook-imagestream.yaml
Configmaps are used to set custom notebook resource limits such as number of cpu cores, memory and GPUs. This is necessary for the jupyter pod to get scheduled on a GPU node.
Apply the following configmap before the launching jupyterhub server.
oc apply -f src/jupyterhub-notebook-sizes.yml
From within Jupyter, clone the following repo:
These tensorflow notebook examples should run:
-
docs/site/en/tutorials/keras/classification.ipynb
-
docs/site/en/tutorials/quickstart/beginner.ipynb
-
docs/site/en/tutorials/quickstart/advanced.ipynb
oc create token grafana-serviceaccount --duration=2000h -n models
Edit grafana-data-source.yaml
(replace <namespace> and <service-account-token>)
oc create -f grafana-data-source.yaml
Import the sample [DCGM exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) (grafana/NVIDIA_DCGM_Exporter_Dashboard.json
)