NVIDIA/k8s-device-plugin

Only one gpu resource can be recognized on a node with 4 gpus

invokerbyxv opened this issue · 0 comments

After installing k8s-device-plugin according to the documentation:

Capacity:
  cpu:                80
  ephemeral-storage:  459329648Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131618692Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                80
  ephemeral-storage:  423318202896
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131516292Ki
  nvidia.com/gpu:     1
  pods:               110

logs of nvidia-device-plugin

I1016 08:26:57.904249       1 main.go:199] Starting FS watcher.
I1016 08:26:57.904393       1 main.go:206] Starting OS watcher.
I1016 08:26:57.904773       1 main.go:221] Starting Plugins.
I1016 08:26:57.904900       1 main.go:278] Loading configuration.
I1016 08:26:57.907863       1 main.go:303] Updating config with default resource matching patterns.
I1016 08:26:57.908757       1 main.go:314] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "tegra",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I1016 08:26:57.908797       1 main.go:317] Retrieving plugins.
I1016 08:26:57.909561       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I1016 08:26:57.911905       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1016 08:26:57.917178       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

But the node actually has 4 gpu's. Result of docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:18:00.0 Off |                  Off |
| 30%   31C    P8             10W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      Off |   00000000:5E:00.0 Off |                  Off |
| 30%   32C    P8             19W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      Off |   00000000:86:00.0 Off |                  Off |
| 30%   31C    P8             21W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      Off |   00000000:AF:00.0 Off |                  Off |
| 30%   33C    P8             15W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

the version of plugin : v0.16.1