naive question about devices
Closed this issue · 2 comments
Hi! I'm new to setting up infiniband, and am working on getting a setup in Azure. I have it working on a set of nodes in AKS, and I'm working on binding to the pod. My naive question is about the output (on the node) where I see two devices, mlx5_0 and mlx5_1. I notice that one is infiniband and one is ethernet:
Is that right? And if so, what distinguishes the two? When I use the daemonset here to bind to a pod, everything goes through OK, but only the infiniband device (the top one of the two) makes it through. I'm able to see the device everywhere except for when I do ip link
(it does not show up as ib0 as it does on the host). Should I expect it to, and if yes, where can I look to debug this? I tried installing the driver on the pod, and that was unable to restart
but I was able to start
and it didn't make a difference. I was next going to build libfabric with ucx, and I did, but I got a bunch of bus errors. I suspect there is an issue with how the pod is seeing the device, and wanted to ask for your wisdom.
Also note that the screenshot is for usernetes (user space kubernetes) and that is an environment from yesterday (different setup) where I'm facing similar issues, but slightly more challenging because of the user space stuff, and I didn't make it as far. I'm also learning a lot and enjoying it, so thank you!
it depends what hardware you have in the AKS instance. its totally possible that one NIC is Infiniband and the other is Ethernet.
k8s-rdma-shared-device-plugin support devices with both link types.
the devices which rdma shared device plugin exposes depend on its configuration (provided via configmap usually)
Thanks! I think this would have been helpful discussion 2 months ago, but we wound up creating a custom installer. https://github.com/converged-computing/aks-infiniband-install. The code here was really helpful for learning and understanding, so thank you!