GoogleCloudPlatform/container-engine-accelerators

Adding NUMA/TopologyManager support to gpu device plugin

robertdavidsmith opened this issue · 1 comments

Hi,

I’m interested in adding TopologyManager/NUMA support to your GPU device plugin.

I believe this is a case of

  1. Upgrade container-engine-accelerators/vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto to a newer version that includes TopologyInfo
  2. Find a way of mapping paths under /dev to paths under /sys (for example map /dev/nvidia0 to /sys/devices/pci0000:00/0000:00:05.0)
  3. Read file such as /sys/devices/pci0000:00/0000:00:05.0/numa_node and return over protobuf

Steps 1 and 3 should be easy enough. Step 2 is harder because of the need to get the PCI Id for a device.

For device->pci id mapping, options I’m aware of are

  1. It would be great if we could do the mapping just by looking under /sys. This is easy for disks (just look under /sys/block) but doesn’t appear possible for GPUs (I’d love to be proven wrong here).
  2. Make use of NVML’s nvmlDeviceGetPciInfo function. Making the device plugin use NVML has already been attempted at https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/52/files, but this PR was never merged. If we could get this PR merged, adding a call to nvmlDeviceGetPciInfo would be trivial.
  3. Run nvidia-smi then parse the output.

What are your thoughts? It would be great to agree a design before I start work on a new PR.

Kind regards,

Rob