/gpu-workshop

Using GPUs on Red Hat Platforms

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Red Hat GPU Workshop

1. Overview

A workshop focused on using NVIDIA GPUs on Red Hat platforms.

2. Background

Graphics Processing Units (GPUs) were originally invented to allow application developers to program 3D graphics accelerators to render photo realistic images in real time. The key is GPUs accelerate matrix and vector math operations (dot product, cross product and matrix multiplies). It turns out these math operations are used in many applications besides 3D graphics including high performance computing and machine learning. As a result, software libraries (i.e. CUDA) were developed to allow non-graphics or general purpose computing applications to take advantage of GPU hardware.

Fixed
Figure 1. Fixed
Programmable
Figure 2. Programmable

GPUs provide a programable graphics pipeline for realistic image rendering.

3. RHEL 9.1

The following NVIDIA® software is required for GPU support.

NVIDIA® GPU drivers version 450.80.02 or higher.
CUDA® Toolkit 11.2.
cuDNN SDK 8.1.0.
(Optional) TensorRT to improve latency and throughput for inference.

The easy way to install the NVIDIA stack is to use the Ansible based provisioning.

Once the stack is built, running the nvidia-smi utility will cause the kernel modules to load will verify the driver installation.

nvidia-smi

Example output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0    26W /  70W |      0MiB / 15360MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3.1. RHEL Testing

Non-container app test

Now the system should be ready to run a gpu workload. Run a simple test to validate the software stack.

Create a python environment and install tensorflow.

python3 -m venv venv
source venv/bin/activate
pip install pip tensorflow -U

Run a simple tensorflow test to confirm a GPU device is found.

ipython

>>> import tensorflow as tf

>>> tf.config.list_logical_devices()

Sample output.

Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14644 MB
memory:  -> device: 0, name: Tesla T4, pci bus id: 0001:00:00.0, compute capability: 7.5

[LogicalDevice(name='/device:CPU:0', device_type='CPU'),
 LogicalDevice(name='/device:GPU:0', device_type='GPU')]

Run the script to test the tensorflow devices.

python src/tf-test.py

Compare the CPU vs. GPU elapsed time in the output.

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Matrix Multiply Elapsed Time: {'CPU': 6.495161056518555, 'GPU': 0.9890825748443604}

4. Additional RHEL Demos

4.1. CUDA Demo

cd /usr/local/cuda-11.8/extras/demo_suite
./nbody -benchmark -cpu
./nbody -benchmark

4.2. Container Demos

4.2.1. For RHEL8.x, install the NVIDIA container toolkit.

4.2.2. For RHEL9.x, follow this blog post to

install the NVIDIA container toolkit.

4.3. Containerized app test

The nvidia-smi output should be similar to what was reported above.

podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ nvcr.io/nvidia/cuda:11.3.0-devel-ubi8 nvidia-smi

5. Openshift

The easy way to install the NVIDIA stack is to use the Ansible based provisioning

Wait for all the pods to have a completed or running status. This could take several minutes.

oc get pods -n nvidia-gpu-operator

The daemonset pods will build a driver for each node with a GPU.

oc logs nvidia-driver-daemonset-410.84.202204112301-0-gf4t4  -n nvidia-gpu-operator  nvidia-driver-ctr --follow

Tue May 17 19:41:23 UTC 2022 Waiting for openshift-driver-toolkit-ctr container to build the precompiled driver ...

Check the logs from one of the nvidia-cuda-validator pods.

oc logs -n nvidia-gpu-operator nvidia-cuda-validator-qpqcg


cuda workload validation is successful
Openshift Application Testing

Create a project as a cluster-admin user then create a Jupyter/Tensorflow application and expose it’s service as an Openshift route.

APP_NAME=tensorflow
oc new-project gputest
oc new-app --name=${APP_NAME} docker.io/tensorflow/tensorflow:latest-gpu-jupyter
oc expose service/tensorflow
ROUTE=http://$(oc get route ${APP_NAME} --template={{.spec.host}})

Dump the logs of the tensorflow pod to obtain the jupyter lab token.

POD=$(oc get pods --selector=deployment=tensorflow --output=custom-columns=:.metadata.name --no-headers)

oc logs ${POD} | grep token
http://tensorflow-544f7d6b5b-m8sjg:8888/?token=7f5cfa6a9780fd77594c1e6a45ae88002169e98d87a38580

It may be necessary to set the nvidia.com/gpu=1 limit to ensure the pod get scheduled on a node with a GPU.

oc set resources deployment/${APP_NAME} --requests=nvidia.com/gpu=1 --limits=nvidia.com/gpu=1

Connect to the tensorflow pod and run a quick GPU test.

oc rsh ${POD} bash

tf-docker /tf > python
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
>>> import tensorflow as tf
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> exit()
$
Jupyter/Tensorflow Example
  • Visit the ${ROUTE} from above.

  • Use the token to login to Jupyter.

  • Open the tensorflow-tutorials/classification.ipynb notebook.

  • Run all of the cells.

  • It should train, test and validate a machine learning model.

6. GPU Dashboard (Openshift v4.11+)

Install the GPU console plugin dashboard by following the Openshift documentation

7. OpenDataHub (optional)

Create a new project for OpenDataHub.

Using the Openshift web console, create an instance of the ODH operator in this project.

Create an ODH instance in your namespace.

Create the CUDA enabled notebook image streams.

oc apply -f https://raw.githubusercontent.com/red-hat-data-services/odh-manifests/master/jupyterhub/notebook-images/overlays/additional/tensorflow-notebook-imagestream.yaml
Custom Notebook Limits (Optional)

Configmaps are used to set custom notebook resource limits such as number of cpu cores, memory and GPUs. This is necessary for the jupyter pod to get scheduled on a GPU node.

Apply the following configmap before the launching jupyterhub server.

oc apply -f src/jupyterhub-notebook-sizes.yml

From within Jupyter, clone the following repo:

These tensorflow notebook examples should run:

  • docs/site/en/tutorials/keras/classification.ipynb

  • docs/site/en/tutorials/quickstart/beginner.ipynb

  • docs/site/en/tutorials/quickstart/advanced.ipynb

8. DIY Grafana GPU Dashboard

oc create token grafana-serviceaccount --duration=2000h -n models

Edit grafana-data-source.yaml (replace <namespace> and <service-account-token>)

oc create -f grafana-data-source.yaml

Import the sample [DCGM exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) (grafana/NVIDIA_DCGM_Exporter_Dashboard.json)

prometheus