English version|中文版
4paradigm k8s vGPU scheduler is an "all in one" chart to manage your GPU in k8s cluster, it has everything you expect for a k8s GPU manager, including:
GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.
Device Memory Control: GPUs can be allocated with certain device memory size (i.e 3000M) or device memory percentage of whole GPU(i.e 50%) and have made it that it does not exceed the boundary.
Virtual Device memory: You can oversubscribe GPU device memory by using host memory as its swap.
GPU Type Specification: You can specify which type of GPU to use or to avoid for a certain GPU task, by setting "nvidia.com/use-gputype" or "nvidia.com/nouse-gputype" annotations.
Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation. In addition, you can specify your resource name other than "nvidia.com/gpu" if you wish
The k8s vGPU scheduler is based on retaining features of 4paradigm k8s-device-plugin (4paradigm/k8s-device-plugin), such as splitting the physical GPU, limiting the memory, and computing unit. It adds the scheduling module to balance the GPU usage across GPU nodes. In addition, it allows users to allocate GPU by specifying the device memory and device core usage. Furthermore, the vGPU scheduler can virtualize the device memory (the used device memory can exceed the physical device memory), run some tasks with large device memory requirements, or increase the number of shared tasks. You can refer to the benchmarks report.
- Scenarios when pods need to be allocated with certain device memory usage or device cores.
- Needs to balance GPU usage in cluster with mutiple GPU node
- Low utilization of device memory and computing units, such as running 10 tf-servings on one GPU.
- Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and the cloud platform that provides small GPU instance.
- In the case of insufficient physical device memory, virtual device memory can be turned on, such as training of large batches and large models.
The list of prerequisites for running the NVIDIA device plugin is described below:
- NVIDIA drivers >= 384.81
- nvidia-docker version > 2.0
- Kubernetes version >= 1.16
- glibc >= 2.17
- kernel version >= 3.10
- helm > 3.0
The following steps need to be executed on all your GPU nodes.
This README assumes that the NVIDIA drivers and the nvidia-container-toolkit
have been pre-installed.
It also assumes that you have configured the nvidia-container-runtime
as the default low-level runtime to use.
Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
When running kubernetes
with docker
, edit the config file which is usually
present at /etc/docker/daemon.json
to set up nvidia-container-runtime
as
the default low-level runtime:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
And then restart docker
:
$ sudo systemctl daemon-reload && systemctl restart docker
When running kubernetes
with containerd
, edit the config file which is
usually present at /etc/containerd/config.toml
to set up
nvidia-container-runtime
as the default low-level runtime:
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
And then restart containerd
:
$ sudo systemctl daemon-reload && systemctl restart containerd
Then, you need to label your GPU nodes which can be scheduled by 4pd-k8s-scheduler by adding "gpu=on", otherwise, it cannot be managed by our scheduler.
kubectl label nodes {nodeid} gpu=on
First, you need to heck your Kubernetes version by the using the following command
kubectl version
Then, add our repo in helm
helm repo add vgpu-charts https://4paradigm.github.io/k8s-vgpu-scheduler
You need to set the Kubernetes scheduler image version according to your Kubernetes server version during installation. For example, if your cluster server version is 1.16.8, then you should use the following command for deployment
helm install vgpu vgpu-charts/vgpu --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system
You can customize your installation by adjusting configs.
You can verify your installation by the following command:
$ kubectl get pods -n kube-system
If the following two pods vgpu-device-plugin
and vgpu-scheduler
are in Running state, then your installation is successful.
NVIDIA vGPUs can now be requested by a container
using the nvidia.com/gpu
resource type:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 vGPUs
nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory (Optional,Integer)
nvidia.com/gpucores: 30 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
You should be cautious that if the task can't fit in any GPU node(ie. the number of nvidia.com/gpu
you request exceeds the number of GPU in any node). The task will get stuck in pending
state.
You can now execute nvidia-smi
command in the container and see the difference of GPU memory between vGPU and real GPU.
WARNING: if you don't request vGPUs when using the device plugin with NVIDIA images all the vGPUs on the machine will be exposed inside your container.
Click here
Default schedulerPort is 31998, other values can be set using --set deivcePlugin.service.schedulerPort
during installation.
Monitoring is automatically enabled after installation. You can get vGPU status of a node by visiting
http://{nodeip}:{monitorPort}/metrics
Default monitorPort is 31992, other values can be set using --set deivcePlugin.service.httpPort
during installation.
grafana dashboard example
Note The status of a node won't be collected before any GPU operations
To Upgrade the k8s-vGPU to the latest version, all you need to do is update the repo and restart the chart.
$ helm uninstall vgpu -n kube-system
$ helm repo update
$ helm install vgpu vgpu -n kube-system
helm uninstall vgpu -n kube-system
Current schedule strategy is to select GPU with the lowest task. Thus balance the loads across mutiple GPUs
Three instances from ai-benchmark have been used to evaluate vGPU-device-plugin performance as follows
Test Environment | description |
---|---|
Kubernetes version | v1.12.9 |
Docker version | 18.09.1 |
GPU Type | Tesla V100 |
GPU Num | 2 |
Test instance | description |
---|---|
nvidia-device-plugin | k8s + nvidia k8s-device-plugin |
vGPU-device-plugin | k8s + VGPU k8s-device-plugin,without virtual device memory |
vGPU-device-plugin(virtual device memory) | k8s + VGPU k8s-device-plugin,with virtual device memory |
Test Cases:
test id | case | type | params |
---|---|---|---|
1.1 | Resnet-V2-50 | inference | batch=50,size=346*346 |
1.2 | Resnet-V2-50 | training | batch=20,size=346*346 |
2.1 | Resnet-V2-152 | inference | batch=10,size=256*256 |
2.2 | Resnet-V2-152 | training | batch=10,size=256*256 |
3.1 | VGG-16 | inference | batch=20,size=224*224 |
3.2 | VGG-16 | training | batch=2,size=224*224 |
4.1 | DeepLab | inference | batch=2,size=512*512 |
4.2 | DeepLab | training | batch=1,size=384*384 |
5.1 | LSTM | inference | batch=100,size=1024*300 |
5.2 | LSTM | training | batch=10,size=1024*300 |
To reproduce:
- install k8s-vGPU-scheduler,and configure properly
- run benchmark job
$ kubectl apply -f benchmarks/ai-benchmark/ai-benchmark.yml
- View the result by using kubctl logs
$ kubectl logs [pod id]
- Specify the number of vGPUs divided by each physical GPU.
- Limits vGPU's Device Memory.
- Allows vGPU allocation by specifying device memory
- Limits vGPU's Streaming Multiprocessor.
- Allows vGPU allocation by specifying device core usage
- Zero changes to existing programs
-
Virtual Device Memory
The device memory of the vGPU can exceed the physical device memory of the GPU. At this time, the excess part will be put in the RAM, which will have a certain impact on the performance.
- Currently, A100 MIG can only support "none" and "mixed" mode
- Currently, task with filed "nodeName" can't be scheduled, please use "nodeSelector" instead
- Currently, only computing tasks are supported, and video codec processing is not supported.
- Support video codec processing
- Support Multi-Instance GPUs (MIG)
- TensorFlow 1.14.0/2.4.1
- torch 1.1.0
- mxnet 1.4.0
- mindspore 1.1.1
The above frameworks have passed the test.
- You can report a bug, a doubt or modify by filing a new issue
- If you want to know more or have ideas, you can participate in the Discussions and the slack exchanges
- Mengxuan Li (limengxuan@4paradigm.com)
- Zhaoyou Pei (peizhaoyou@4paradigm.com)
- Guangchuan Shi (shiguangchuan@4paradigm.com)
- Zhao Zheng (zhengzhao@4paradigm.com)
Owner & Maintainer: Limengxuan
Feel free to reach me by
email: <limengxuan@4paradigm.com>
phone: +86 18810644493
WeChat: xuanzong4493