Only one gpu resource can be recognized on a node with 4 gpus
invokerbyxv opened this issue · 0 comments
invokerbyxv commented
After installing k8s-device-plugin according to the documentation:
Capacity:
cpu: 80
ephemeral-storage: 459329648Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131618692Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 80
ephemeral-storage: 423318202896
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131516292Ki
nvidia.com/gpu: 1
pods: 110
logs of nvidia-device-plugin
I1016 08:26:57.904249 1 main.go:199] Starting FS watcher.
I1016 08:26:57.904393 1 main.go:206] Starting OS watcher.
I1016 08:26:57.904773 1 main.go:221] Starting Plugins.
I1016 08:26:57.904900 1 main.go:278] Loading configuration.
I1016 08:26:57.907863 1 main.go:303] Updating config with default resource matching patterns.
I1016 08:26:57.908757 1 main.go:314]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "tegra",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I1016 08:26:57.908797 1 main.go:317] Retrieving plugins.
I1016 08:26:57.909561 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I1016 08:26:57.911905 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1016 08:26:57.917178 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
But the node actually has 4 gpu's. Result of docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 D Off | 00000000:18:00.0 Off | Off |
| 30% 31C P8 10W / 425W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 D Off | 00000000:5E:00.0 Off | Off |
| 30% 32C P8 19W / 425W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 D Off | 00000000:86:00.0 Off | Off |
| 30% 31C P8 21W / 425W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 D Off | 00000000:AF:00.0 Off | Off |
| 30% 33C P8 15W / 425W | 12MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
the version of plugin : v0.16.1