/instaslice-operator

InstaSlice Operator facilitates slicing of accelerators using stable APIs

Primary LanguageGoApache License 2.0Apache-2.0

Note - Kubecon EU 2024 code (DRA code) is now available in the legacy branch

InstaSlice

InstaSlice works with GPU operator to create mig slices on demand.

Why InstaSlice

Partitionable accelerators provided by vendors need partition to be created at node boot-time or to change partitions one would have to evict all the workloads at the node level to create new set of partitions.

InstaSlice will help if

  • user does not know all the accelerators partitions needed a priori on every node on the cluster
  • user partition requirements change at the workload level rather than the node level
  • user does not want to learn or use new API to request accelerators slices
  • user prefers to use stable device plugins APIs for creating partitions

Features overview

Demo

InstaSlice demo

Getting Started

Prerequisites

Install and configure required NVIDIA software on the host

  1. Install the NVIDIA GPU drivers and CUDA toolkit on the host.

  2. Install the NVIDIA Container Toolkit (CTK).

  3. Configure the NVIDIA Container Runtime as the default Docker runtime:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
  1. Restart Docker to apply the changes:
sudo systemctl restart docker
  1. Configure the NVIDIA Container Runtime to use volume mounts to select devices to inject into a container:
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place

This sets accept-nvidia-visible-devices-as-volume-mounts=true in the /etc/nvidia-container-runtime/config.toml file.

Enable MIG on the GPU

  • Check if MIG is enabled on the host GPU - look for Enabled in the third row of the table:
nvidia-smi
Sun Aug 18 09:41:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:07:00.0 Off |                   On |
| N/A   27C    P0             31W /  250W |       1MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • If MIG is disabled, enabled it by running:
nvidia-smi -i <gpu-id> -mig 1

Example:

nvidia-smi -i 0 -mig 1

Note: You may need to reboot the node for the changes to take effect. An asterisk beside MIG status (e.g. Enabled*) means the changes are pending and will be applied after a reboot.

Install KinD cluster with GPU operator

Create a Kind cluster and install the NVIDIA GPU Operator:

bash ./deploy/setup.sh

Note: The validator pods nvidia-cuda-validator-* and nvidia-operator-validator-* of the GPU operator are expected to fail to initialize. This is because with MIG enabled, but without a MIG partition they effectively have no GPU to run on.

kubectl get pod -n gpu-operator
NAME                                                          READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-lzcpv                                   2/2     Running                 0              5m48s
gpu-operator-7b5587d878-vq2gw                                 1/1     Running                 0              6m59s
gpu-operator-node-feature-discovery-gc-8478d46f4c-wvx29       1/1     Running                 0              6m59s
gpu-operator-node-feature-discovery-master-688bb86496-cn97b   1/1     Running                 0              6m59s
gpu-operator-node-feature-discovery-worker-7twxt              1/1     Running                 0              6m52s
nvidia-container-toolkit-daemonset-gpn22                      1/1     Running                 0              6m13s
nvidia-cuda-validator-sjqgk                                   0/1     Init:CrashLoopBackOff   5 (111s ago)   4m54s
nvidia-dcgm-exporter-tlcpv                                    1/1     Running                 0              6m7s
nvidia-device-plugin-daemonset-wbbhx                          2/2     Running                 0              5m53s
nvidia-operator-validator-h7ngh                               0/1     Init:2/4                0              6m10s

Deploy InstaSlice

  1. Optionally, build and push custom, up-to-date controller and daemonset images from source:
IMG=<registry>/<controller-image>:<tag> IMG_DMST=<registry>/<daemonset-image>:<tag> make docker-build docker-push

Example:

IMG=quay.io/example/instaslice2-controller:1.0 IMG_DMST=quay.io/example/instaslice2-daemonset:1.0 make docker-build docker-push

Note: You can use Podman instead of Docker to build images, just set CONTAINER_TOOL=podman before the image-related make targets.

Cross-platform or multi-arch images can be built and pushed using make docker-buildx. When using Docker as your container tool, make sure to create a builder instance. Refer to Multi-platform images for documentation on building mutli-platform images with Docker. You can change the destination platform(s) by setting PLATFORMS, e.g.:

PLATFORMS=linux/arm64,linux/amd64 make docker-buildx
  1. Deploy the controller and daemonset with the default images. All required CRDs will be installed by this command:
make deploy

or with custom-build images:

IMG=<registry>/<controller-image>:<tag> IMG_DMST=<registry>/<daemonset-image>:<tag> make deploy

Example:

IMG=quay.io/example/instaslice2-controller:1.0 IMG_DMST=quay.io/example/instaslice2-daemonset:1.0 make deploy

The all-in-one command for building and deploying InstaSlice:

# make docker-build docker-push deploy

Or with custom images:

IMG=<registry>/<controller-image>:<tag> IMG_DMST=<registry>/<daemonset-image>:<tag> make docker-build docker-push deploy

Example:

IMG=quay.io/example/instaslice2-controller:1.0 IMG_DMST=quay.io/example/instaslice2-daemonset:1.0 make docker-build docker-push deploy
  1. Verify that the InstaSlice pods are successfully running:
kubectl get pod -n instaslice-system
NAME                                               READY   STATUS    RESTARTS   AGE
instaslice-operator-controller-daemonset-5lbqg            1/1     Running   0          101s
instaslice-operator-controller-manager-57b549784c-wkqq2   2/2     Running   0          101s

Note: If you encounter RBAC errors, you may need to grant yourself cluster-admin privileges or be logged in as admin.

Run a sample workload

  1. Submit a sample workload:
kubectl apply -f ./samples/test-pod.yaml
pod/cuda-vectoradd-1 created
  1. check the status of the workload using commands
kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
cuda-vectoradd-1   1/1     Running   0          15s

and

kubectl logs cuda-vectoradd-1
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-1785aa6b-6edf-f58e-2e29-f6ccd30f306f)
  MIG 1g.5gb      Device  0: (UUID: MIG-2cc7f78c-04eb-5a3c-92c7-f423e3572bb8)
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

While the pod is running, you can observe the MIG slice created for it automatically:

nvidia-smi
Sun Aug 18 11:48:20 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:07:00.0 Off |                   On |
| N/A   32C    P0             63W /  250W |      13MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0   11   0   0  |              13MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |
|                  |                 0MiB /  8191MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
...
  1. Delete the sample pod and see its MIG slice automatically deleted.
kubectl delete -f ./samples/test-pod.yaml
nvidia-smi
Sun Aug 18 13:34:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:07:00.0 Off |                   On |
| N/A   32C    P0             61W /  250W |       1MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+
...

Create instances of your solution

You can apply the samples (examples) from the sample directory:

kubectl apply -k samples/

NOTE: Ensure that the samples use the default values to test it out.

Uninstall

  1. Delete all running samples from the cluster:
kubectl delete -k samples/
  1. Delete the CRDs:
make uninstall
  1. Undeploy InstaSlice:
make undeploy
  1. To delete the Kind cluster, just run:
kind delete cluster

Run InstaSlice in simulator mode

Users(mainly developers) can leverage running the instaslice operator using the emulator mode as described here This has been tested on a single node cluster as of now.

Running e2e tests

To run the e2e tests locally, run the following command:

make test-e2e

These e2e tests would be performed by creating a kind cluster locally

Roadmap

High level overview of the main priorities for 2024:

  • Allocate MIG slices on Nvidia GPUs on demand
  • Configire allocated slices on GPUs and bind containers to slices
  • Release and unconfigure slices when pods are completed or deleted
  • Ability to graceful termination of slices
  • Account for node classical resources when selecting a node
  • Schedule pods in average of 10 seconds when resources are available
  • Kubernetes quota system integration
  • Konflux onboarding
  • Operator SDK integration

Future tasks:

  • Stable integration with project Kueue
  • Stable integration with provisioning request CRD to support autoscaling
  • Handle pods requesting multiple slices
  • Manage slices on heterogenous GPU types in the cluster
  • Improved fault tolerance
  • Leverage DRA implementation

License

Copyright 2024.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.