This DRA resource driver is currently under active development and not yet
designed for production use.
We will continually be force pushing over main
until we have something more stable.
Use at your own risk.
A document and demo of the DRA support for GPUs provided by this repo can be found below:
Document | Demo |
---|---|
This section describes using kind
to demo the functionality of the NVIDIA GPU DRA Driver.
First since we'll launch kind with GPU support, ensure that the following prerequisites are met:
kind
is installed. See the official documentation here.- Ensure that the NVIDIA Container Toolkit is installed on your system. This can be done by following the instructions here.
- Configure the NVIDIA Container Runtime as the default Docker runtime:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
- Restart Docker to apply the changes:
sudo systemctl restart docker
- Set the
accept-nvidia-visible-devices-as-volume-mounts
option totrue
in the/etc/nvidia-container-runtime/config.toml
file to configure the NVIDIA Container Runtime to use volume mounts to select devices to inject into a container.
We start by first cloning this repository and cd
ing into it.
All of the scripts and example Pod specs used in this demo are in the demo
subdirectory, so take a moment to browse through the various files and see
what's available:
git clone https://github.com/NVIDIA/k8s-dra-driver.git
cd k8s-dra-driver
First, create a kind
cluster to run the demo:
./demo/clusters/kind/create-cluster.sh
From here we will build the image for the example resource driver:
./demo/clusters/kind/build-dra-driver.sh
This also makes the built images available to the kind
cluster.
We now install the NVIDIA GPU DRA driver:
./demo/clusters/kind/install-dra-driver.sh
This should show two pods running in the nvidia-dra-driver
namespace:
$ kubectl get pods -n nvidia-dra-driver
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-dra-driver nvidia-dra-controller-6bdf8f88cc-psb4r 1/1 Running 0 34s
nvidia-dra-driver nvidia-dra-plugin-lt7qh 1/1 Running 0 32s
Finally, you can run the various examples contained in the demo/specs/quickstart
folder.
The README
in that directory shows the full script of the demo you can walk through.
cat demo/specs/quickstart/README.md
...
Where the running the first three examples should produce output similar to the following:
$ kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml
...
$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-test1 pod1 1/1 Running 0 34s
gpu-test1 pod2 1/1 Running 0 34s
gpu-test2 pod 2/2 Running 0 34s
gpu-test3 pod1 1/1 Running 0 34s
gpu-test3 pod2 1/1 Running 0 34s
...
$ kubectl logs -n gpu-test1 -l app=pod
GPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
$ kubectl logs -n gpu-test2 pod --all-containers
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
$ kubectl logs -n gpu-test3 -l app=pod
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
Running
$ ./demo/clusters/kind/delete-cluster.sh
will remove the cluster created in the preceding steps.