Mellanox/k8s-rdma-shared-dev-plugin

Migrating from sriov to shared device plugin

Closed this issue · 4 comments

Hi,

I am trying to use rdma shared device plugin in the cluster which previous used sriov.

I have uninstalled sriov device plugin & sriov configmap. Switched to rdma shared net namespaces and then the shared device plugin was able to run normally.

rdma shared device plugin log shows HCA devices to be discovered but seems kubelet doesn't watch this resource. Restarted kubelet & containerd on the hosts doesn't help and kubelet logs don't have any clues. Any recommendation how to handle this?

Config map

apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
    {
        "periodicUpdateInterval": 300,
        "configList": [{
                "resourceName": "hca_roce",
                "rdmaHcaMax": 5,
                "selectors": {
                  "vendors": ["15b3"],
                  "deviceIDs": ["1019"]
                }
            }
        ]
    } 

rmda shared device logs

2024/05/07 11:27:25 Initializing resource servers
2024/05/07 11:27:25 Resource: &{ResourceName:hca_roce ResourcePrefix:rdma RdmaHcaMax:5 Devices:[] Selectors:{Vendors:[15b3] DeviceIDs:[1019] Drivers:[] IfNames:[] LinkTypes:[]}}
2024/05/07 11:27:25 Starting all servers...
2024/05/07 11:27:25 starting rdma/hca_roce device plugin endpoint at: hca_roce.sock
2024/05/07 11:27:25 rdma/hca_roce device plugin endpoint started serving
2024/05/07 11:27:25 All servers started.
2024/05/07 11:27:25 Listening for term signals
2024/05/07 11:27:25 Starting OS watcher.
2024/05/07 11:32:25 discovering host network devices
2024/05/07 11:32:25 DiscoverHostDevices(): device found: 0000:0c:00.0   02              Mellanox Technolo...    MT28800 Family [ConnectX-5 Ex]
2024/05/07 11:32:25 DiscoverHostDevices(): device found: 0000:0c:00.1   02              Mellanox Technolo...    MT28800 Family 
...
2024/05/07 11:32:25 DiscoverHostDevices(): device found: 0000:d1:00.0   02              Mellanox Technolo...    MT28800 Family [ConnectX-5 Ex]
2024/05/07 11:32:25 DiscoverHostDevices(): device found: 0000:d1:00.1   02              Mellanox Technolo...    MT28800 Family [ConnectX-5 Ex]
2024/05/07 11:32:25 no changes to devices for "rdma/hca_roce"
**2024/05/07 11:32:25 exposing "5" devices**

kubectl describe node does not show rdma/hca_roce

Capacity:
  cpu:                       128
...
  nvidia.com/gpu:            8
  nvidia.com/rdma_sriov_vf:  0
  pods:                      110
Allocatable:
..
  nvidia.com/gpu:            8
  nvidia.com/rdma_sriov_vf:  0
  pods:                      110

This is resolved now. We explicitly use root-dir path for kubelet mounts. This was leading to device plugin registration failure with Kubelet.

As a workaround we have created a symlink for the default path (/var/lib/kubelet) to point it at the root-dir. It works well.

Our situation is very similar to what you described. Could you share the configuration that ultimately worked for you?
We are using k8s version 1.25, with the following configuration:

        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
        - mountPath: /var/lib/kubelet/plugins_registry
          name: plugins-registry
        - mountPath: /k8s-rdma-shared-dev-plugin
          name: config
        - mountPath: /dev/
          name: devs
      dnsPolicy: ClusterFirst
      hostNetwork: true
      nodeSelector:
        node-role.kubernetes.io/RDMA: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
      - hostPath:
          path: /var/lib/kubelet/plugins_registry
          type: ""
        name: plugins-registry
      - configMap:
          defaultMode: 420
          items:
          - key: config.json
            path: config.json
          name: rdma-devices
        name: config
      - hostPath:
          path: /dev/
          type: ""
        name: devs

configmap:

{
        "periodicUpdateInterval": 300,
        "configList": [
           {
             "resourceName": "hpc_shared_devices_a",
             "rdmaHcaMax": 1000,
             "selectors": {
              "vendors": ["15b3"],
              "deviceIDs": ["1015"]
             }
            }
        ]
    }

rmda shared device logs:

2024/05/28 03:09:59 Starting K8s RDMA Shared Device Plugin version= master
2024/05/28 03:09:59 resource manager reading configs
2024/05/28 03:09:59 Reading /k8s-rdma-shared-dev-plugin/config.json
Using Kubelet Plugin Registry Mode
2024/05/28 03:09:59 loaded config: [{ResourceName:hpc_shared_devices_a ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[15b3] DeviceIDs:[1015] Drivers:[] IfNames:[] LinkTypes:[]}}]
2024/05/28 03:09:59 periodic update interval: +300
2024/05/28 03:09:59 Discovering host devices
2024/05/28 03:09:59 discovering host network devices
2024/05/28 03:09:59 DiscoverHostDevices(): device found: 0000:04:00.0	02          	Broadcom Inc. and...	NetXtreme BCM5720 Gigabit Ethernet PCIe
2024/05/28 03:09:59 DiscoverHostDevices(): device found: 0000:04:00.1	02          	Broadcom Inc. and...	NetXtreme BCM5720 Gigabit Ethernet PCIe
2024/05/28 03:09:59 DiscoverHostDevices(): device found: 0000:af:00.0	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]
2024/05/28 03:09:59 DiscoverHostDevices(): device found: 0000:af:00.1	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]
2024/05/28 03:09:59 Initializing resource servers
2024/05/28 03:09:59 Resource: &{ResourceName:hpc_shared_devices_a ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[15b3] DeviceIDs:[1015] Drivers:[] IfNames:[] LinkTypes:[]}}
2024/05/28 03:09:59 error creating new device: "missing RDMA device spec for device 0000:04:00.0, RDMA device \"issm\" not found"
2024/05/28 03:09:59 error creating new device: "missing RDMA device spec for device 0000:04:00.1, RDMA device \"issm\" not found"
2024/05/28 03:09:59 error creating new device: "missing RDMA device spec for device 0000:af:00.1, RDMA device \"issm\" not found"
2024/05/28 03:09:59 Starting all servers...
2024/05/28 03:09:59 starting rdma/hpc_shared_devices_a device plugin endpoint at: hpc_shared_devices_a.sock
2024/05/28 03:09:59 rdma/hpc_shared_devices_a device plugin endpoint started serving
2024/05/28 03:09:59 All servers started.
2024/05/28 03:09:59 Listening for term signals
2024/05/28 03:09:59 Starting OS watcher.
2024/05/28 03:14:59 discovering host network devices
2024/05/28 03:14:59 DiscoverHostDevices(): device found: 0000:04:00.0	02          	Broadcom Inc. and...	NetXtreme BCM5720 Gigabit Ethernet PCIe
2024/05/28 03:14:59 DiscoverHostDevices(): device found: 0000:04:00.1	02          	Broadcom Inc. and...	NetXtreme BCM5720 Gigabit Ethernet PCIe
2024/05/28 03:14:59 DiscoverHostDevices(): device found: 0000:af:00.0	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]
2024/05/28 03:14:59 DiscoverHostDevices(): device found: 0000:af:00.1	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]
2024/05/28 03:14:59 error creating new device: "missing RDMA device spec for device 0000:04:00.0, RDMA device \"issm\" not found"
2024/05/28 03:14:59 error creating new device: "missing RDMA device spec for device 0000:04:00.1, RDMA device \"issm\" not found"
2024/05/28 03:14:59 error creating new device: "missing RDMA device spec for device 0000:af:00.1, RDMA device \"issm\" not found"
2024/05/28 03:14:59 no changes to devices for "rdma/hpc_shared_devices_a"
2024/05/28 03:14:59 exposing "1000" devices

Hi,

Our Kubelet service uses --root-dir=/var/lib/ssd/kubelet explicitly. So I had to create a symbolic path /var/lib/kubelet -> /var/lib/ssd/kubelet to allow device plugin work. Also, had to ensure existing data in /var/lib/kubelet is intact after creating the link. The default plugin YAML definition was left unchanged.. Hope it helps !!

Thanks, i'll give it a try.