Adding NUMA/TopologyManager support to gpu device plugin
robertdavidsmith opened this issue · 1 comments
robertdavidsmith commented
Hi,
I’m interested in adding TopologyManager/NUMA support to your GPU device plugin.
I believe this is a case of
- Upgrade container-engine-accelerators/vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1/api.proto to a newer version that includes TopologyInfo
- Find a way of mapping paths under /dev to paths under /sys (for example map /dev/nvidia0 to /sys/devices/pci0000:00/0000:00:05.0)
- Read file such as /sys/devices/pci0000:00/0000:00:05.0/numa_node and return over protobuf
Steps 1 and 3 should be easy enough. Step 2 is harder because of the need to get the PCI Id for a device.
For device->pci id mapping, options I’m aware of are
- It would be great if we could do the mapping just by looking under /sys. This is easy for disks (just look under /sys/block) but doesn’t appear possible for GPUs (I’d love to be proven wrong here).
- Make use of NVML’s nvmlDeviceGetPciInfo function. Making the device plugin use NVML has already been attempted at https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/52/files, but this PR was never merged. If we could get this PR merged, adding a call to nvmlDeviceGetPciInfo would be trivial.
- Run nvidia-smi then parse the output.
What are your thoughts? It would be great to agree a design before I start work on a new PR.
Kind regards,
Rob
robertdavidsmith commented
PR here #165