hustcat/k8s-rdma-device-plugin

Failure to initialize the plugin

Opened this issue · 3 comments

Hi, @hustcat ,

I currently am working on a project involving Kubernetes and RDMA-enabled containers. I was very happy to find your RDMA device plugin project on Github as, if worked, it would solve a lot of my problems and I'm very grateful that you published it.

Unfortunately, when I tried to deploy the daemonset as described in your README, it producer the following error:

2018-09-06T14:26:37.916726-04:00 tporch2.lab2-skae 500c35b3b7bd[1580]: time="2018-09-06T18:26:37Z" level=info msg="Fetching devices." 2018-09-06T14:26:37.917634-04:00 tporch2.lab2-skae 500c35b3b7bd[1580]: time="2018-09-06T18:26:37Z" level=error msg="Error to get IB device: open /sys/class/net/flannel.1/device/resource: no such file or directory"

I realized that it is attributed to the following piece of code in rdma.go:

`for _, d := range ibvDevList {
for _, n := range netDevList {
dResource, err := getRdmaDeviceResoure(d.Name)
if err != nil {
return nil, err
}
nResource, err := getNetDeviceResoure(n)
if err != nil {
return nil, err
}

                    // the same device
                    if bytes.Compare(dResource, nResource) == 0 {
                            devs = append(devs, Device{
                                    RdmaDevice: d,
                                    NetDevice:  n,
                            })
                    }
            }
    }`

Several entries in /sys/class/net (docker virtual devices and flannel) don't have device/resource file and would cause this error.

# ls -al /sys/class/net/ total 0 drwxr-xr-x 2 root root 0 Sep 6 14:27 . drwxr-xr-x 74 root root 0 Sep 6 14:27 .. lrwxrwxrwx 1 root root 0 Sep 6 14:27 cni0 -> ../../devices/virtual/net/cni0 lrwxrwxrwx 1 root root 0 Sep 6 14:27 docker0 -> ../../devices/virtual/net/docker0 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth0 -> ../../devices/pci0000:00/0000:00:1c.0/0000:05:00.0/net/eth0 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth1 -> ../../devices/pci0000:00/0000:00:1c.0/0000:05:00.1/net/eth1 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth2 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/net/eth2 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth3 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/net/eth3 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth4 -> ../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/net/eth4 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth5 -> ../../devices/pci0000:00/0000:00:02.0/0000:02:00.1/net/eth5 lrwxrwxrwx 1 root root 0 Sep 6 14:27 flannel.1 -> ../../devices/virtual/net/flannel.1 lrwxrwxrwx 1 root root 0 Sep 6 14:27 lo -> ../../devices/virtual/net/lo lrwxrwxrwx 1 root root 0 Sep 6 14:27 veth22bb6ca5 -> ../../devices/virtual/net/veth22bb6ca5 root@tporch2:~/projects/tporch# ls -al /sys/class/net/*/device lrwxrwxrwx 1 root root 0 Sep 5 14:30 /sys/class/net/eth0/device -> ../../../0000:05:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:30 /sys/class/net/eth1/device -> ../../../0000:05:00.1 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth2/device -> ../../../0000:03:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth3/device -> ../../../0000:03:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth4/device -> ../../../0000:02:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth5/device -> ../../../0000:02:00.1

I understand that a change in the code that checks for the presence of the file before doing the comparison would fix the problem. but I wonder how did you deal with it when you tested your code? Have you not used docker and flannel (or other CNI)? You surely must have some virtual devices in your configuration, no? I would much appreciate the answer before I start hacking the code. :)

@fkogan I think you should assign -master your_ib_network_interface

Error to get IB device: open /sys/class/net/flannel.1/device/resource

If not, it may use flannel.1 as default

@harryge00 - no, the problem is in GetAllNetDevice() in sriov.go. There's an attempt to exclude problematic interfaces already:

`func GetAllNetDevice() ([]string, error) {
var res = []string{}
ifaces, err := net.Interfaces()
if err != nil {
log.Errorf("localAddresses: %+v\n", err)
return nil, err
}
log.Debugf("ifaces: %v", ifaces)
for _, iface := range ifaces {
if iface.Flags&(1<<uint(0)) == 0 {
continue
}
if iface.Flags&(1<<uint(1)) == 0 {
continue
}
if iface.Flags&(1<<uint(2)) != 0 {
continue
}

  if strings.HasPrefix(iface.Name, "docker") || strings.HasPrefix(iface.Name, "cali") {
                    continue
            }
            res = append(res, iface.Name)
    }
    return res, nil

}`

I've just added this piece in rdma.go:

` const RdmaDeviceRource = "/sys/class/infiniband/%s/device/resource"
@@ -102,6 +101,12 @@ func getRdmaDeviceResoure(name string) ([]byte, error) {
func getNetDeviceResoure(name string) ([]byte, error) {
resourceFile := fmt.Sprintf(NetDeviceRource, name)
data, err := ioutil.ReadFile(resourceFile)

  •   if err != nil && os.IsNotExist(err) {
    
  •           // not all Net devices have resource  file
    
  •           // no such file - return empty data
    
  •           data = nil
    
  •           err = nil
    
  •   }
     return data, err
    

}`

The main issue wit hteh code was that it supported only v1apha1 protocol, thus not working with k8s 1.11.2. I've fixed it and made it working for me, but it looks like the project is abandoned, not sure it is worth submitting the changes...

a3y3 commented

@fkogan I'm trying to make the plugin work too. I recognise that it supports only v1alpha, but changing the pluginApi to v1beta1 isn't making any difference for me. Any ideas on what else should I change to make it work for k8s 1.11.2?