InstaSlice works with GPU operator to create mig slices on demand.
Partitionable accelerators provided by vendors need partition to be created at node boot-time or to change partitions one would have to evict all the workloads at the node level to create new set of partitions.
InstaSlice will help if
- user does not know all the accelerators partitions needed a priori on every node on the cluster
- user partition requirements change at the workload level rather than the node level
- user does not want to learn or use new API to request accelerators slices
- user prefers to use stable device plugins APIs for creating partitions
-
Integration with Kubernetes quota management.
-
Emulator mode to run test InstaSlice firstfit placement strategy.
-
Integration with vLLM, Kserve, Deployments, Jobs, and Statefulsets
- Go v1.22.0+
- Docker v17.03+
- KinD v0.23.0+
- Helm v3.0.0+
- Docker buildx plugin for building cross-platform images.
- kubectl v1.11.3+.
-
Install the NVIDIA GPU drivers and CUDA toolkit on the host.
-
Install the NVIDIA Container Toolkit (CTK).
-
Configure the NVIDIA Container Runtime as the default Docker runtime:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
- Restart Docker to apply the changes:
sudo systemctl restart docker
- Configure the NVIDIA Container Runtime to use volume mounts to select devices to inject into a container:
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
This sets accept-nvidia-visible-devices-as-volume-mounts=true
in the /etc/nvidia-container-runtime/config.toml
file.
- Check if MIG is enabled on the host GPU - look for
Enabled
in the third row of the table:
nvidia-smi
Sun Aug 18 09:41:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:07:00.0 Off | On |
| N/A 27C P0 31W / 250W | 1MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
- If MIG is disabled, enabled it by running:
nvidia-smi -i <gpu-id> -mig 1
Example:
nvidia-smi -i 0 -mig 1
Note: You may need to reboot the node for the changes to take effect. An asterisk beside MIG status (e.g. Enabled*
)
means the changes are pending and will be applied after a reboot.
Create a Kind cluster and install the NVIDIA GPU Operator:
bash ./deploy/setup.sh
Note: The validator pods nvidia-cuda-validator-*
and nvidia-operator-validator-*
of the GPU operator are expected to
fail to initialize. This is because with MIG enabled, but without a MIG partition they effectively have no GPU to run on.
kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-lzcpv 2/2 Running 0 5m48s
gpu-operator-7b5587d878-vq2gw 1/1 Running 0 6m59s
gpu-operator-node-feature-discovery-gc-8478d46f4c-wvx29 1/1 Running 0 6m59s
gpu-operator-node-feature-discovery-master-688bb86496-cn97b 1/1 Running 0 6m59s
gpu-operator-node-feature-discovery-worker-7twxt 1/1 Running 0 6m52s
nvidia-container-toolkit-daemonset-gpn22 1/1 Running 0 6m13s
nvidia-cuda-validator-sjqgk 0/1 Init:CrashLoopBackOff 5 (111s ago) 4m54s
nvidia-dcgm-exporter-tlcpv 1/1 Running 0 6m7s
nvidia-device-plugin-daemonset-wbbhx 2/2 Running 0 5m53s
nvidia-operator-validator-h7ngh 0/1 Init:2/4 0 6m10s
- Optionally, build and push custom, up-to-date controller and daemonset images from source:
IMG=<registry>/<controller-image>:<tag> IMG_DMST=<registry>/<daemonset-image>:<tag> make docker-build docker-push
Example:
IMG=quay.io/example/instaslice2-controller:1.0 IMG_DMST=quay.io/example/instaslice2-daemonset:1.0 make docker-build docker-push
Note: You can use Podman instead of Docker to build images, just set CONTAINER_TOOL=podman
before the image-related make targets.
Cross-platform or multi-arch images can be built and pushed using make docker-buildx
. When using Docker as your container tool, make
sure to create a builder instance. Refer to Multi-platform images
for documentation on building mutli-platform images with Docker. You can change the destination platform(s) by setting PLATFORMS
, e.g.:
PLATFORMS=linux/arm64,linux/amd64 make docker-buildx
- Deploy the controller and daemonset with the default images. All required CRDs will be installed by this command:
make deploy
or with custom-build images:
IMG=<registry>/<controller-image>:<tag> IMG_DMST=<registry>/<daemonset-image>:<tag> make deploy
Example:
IMG=quay.io/example/instaslice2-controller:1.0 IMG_DMST=quay.io/example/instaslice2-daemonset:1.0 make deploy
The all-in-one command for building and deploying InstaSlice:
# make docker-build docker-push deploy
Or with custom images:
IMG=<registry>/<controller-image>:<tag> IMG_DMST=<registry>/<daemonset-image>:<tag> make docker-build docker-push deploy
Example:
IMG=quay.io/example/instaslice2-controller:1.0 IMG_DMST=quay.io/example/instaslice2-daemonset:1.0 make docker-build docker-push deploy
- Verify that the InstaSlice pods are successfully running:
kubectl get pod -n instaslice-system
NAME READY STATUS RESTARTS AGE
instaslice-operator-controller-daemonset-5lbqg 1/1 Running 0 101s
instaslice-operator-controller-manager-57b549784c-wkqq2 2/2 Running 0 101s
Note: If you encounter RBAC errors, you may need to grant yourself cluster-admin privileges or be logged in as admin.
- Submit a sample workload:
kubectl apply -f ./samples/test-pod.yaml
pod/cuda-vectoradd-1 created
- check the status of the workload using commands
kubectl get pods
NAME READY STATUS RESTARTS AGE
cuda-vectoradd-1 1/1 Running 0 15s
and
kubectl logs cuda-vectoradd-1
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-1785aa6b-6edf-f58e-2e29-f6ccd30f306f)
MIG 1g.5gb Device 0: (UUID: MIG-2cc7f78c-04eb-5a3c-92c7-f423e3572bb8)
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
While the pod is running, you can observe the MIG slice created for it automatically:
nvidia-smi
Sun Aug 18 11:48:20 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:07:00.0 Off | On |
| N/A 32C P0 63W / 250W | 13MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 11 0 0 | 13MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
...
- Delete the sample pod and see its MIG slice automatically deleted.
kubectl delete -f ./samples/test-pod.yaml
nvidia-smi
Sun Aug 18 13:34:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:07:00.0 Off | On |
| N/A 32C P0 61W / 250W | 1MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+
...
You can apply the samples (examples) from the sample
directory:
kubectl apply -k samples/
NOTE: Ensure that the samples use the default values to test it out.
- Delete all running samples from the cluster:
kubectl delete -k samples/
- Delete the CRDs:
make uninstall
- Undeploy InstaSlice:
make undeploy
- To delete the Kind cluster, just run:
kind delete cluster
Users(mainly developers) can leverage running the instaslice operator using the emulator mode as described here This has been tested on a single node cluster as of now.
To run the e2e tests locally, run the following command:
make test-e2e
These e2e tests would be performed by creating a kind
cluster locally
High level overview of the main priorities for 2024:
- Allocate MIG slices on Nvidia GPUs on demand
- Configire allocated slices on GPUs and bind containers to slices
- Release and unconfigure slices when pods are completed or deleted
- Ability to graceful termination of slices
- Account for node classical resources when selecting a node
- Schedule pods in average of 10 seconds when resources are available
- Kubernetes quota system integration
- Konflux onboarding
- Operator SDK integration
Future tasks:
- Stable integration with project Kueue
- Stable integration with provisioning request CRD to support autoscaling
- Handle pods requesting multiple slices
- Manage slices on heterogenous GPU types in the cluster
- Improved fault tolerance
- Leverage DRA implementation
Copyright 2024.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.