Mellanox/k8s-rdma-shared-dev-plugin

can not ues ib_write_bw

Opened this issue · 2 comments

my problem is same to #72. when I run ib_write_bw in pod, I get a error as follows:

[root@mofed-test-cx5-bond-pod2 /]# ib_write_bw -d mlx5_0  -F --report_gbits 

************************************
* Waiting for client to connect... *
************************************

[root@mofed-test-cx5-bond-pod1 /]# ib_write_bw -d mlx5_0  -F --report_gbits 10.56.217.73
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 0
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x00da PSN 0x416f84 RKey 0x1804c1 VAddr 0x007f8de93fe000
 GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
 remote address: LID 0000 QPN 0x0132 PSN 0xbaf42b RKey 0x181ddc VAddr 0x007fb24d4d1000
 GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
Failed to modify QP 218 to RTR
 Unable to Connect the HCA's through the link

pod1's network card information

[root@mofed-test-cx5-bond-pod1 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.16.8.42  netmask 255.255.255.0  broadcast 172.16.8.255
        inet6 fe80::c80e:9ff:fe94:143f  prefixlen 64  scopeid 0x20<link>
        ether ca:0e:09:94:14:3f  txqueuelen 0  (Ethernet)
        RX packets 52  bytes 5232 (5.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 15  bytes 1102 (1.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 2  bytes 100 (100.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2  bytes 100 (100.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.56.217.72  netmask 255.255.255.0  broadcast 10.56.217.255
        inet6 fe80::7c8f:a5ff:fe49:e86d  prefixlen 64  scopeid 0x20<link>
        ether 7e:8f:a5:49:e8:6d  txqueuelen 0  (Ethernet)
        RX packets 998  bytes 62494 (61.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 149  bytes 11268 (11.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

pod2's network card information

[root@mofed-test-cx5-bond-pod2 /]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 172.16.4.73  netmask 255.255.255.0  broadcast 172.16.4.255
        inet6 fe80::8c2f:74ff:fe8c:1418  prefixlen 64  scopeid 0x20<link>
        ether 8e:2f:74:8c:14:18  txqueuelen 0  (Ethernet)
        RX packets 23  bytes 1786 (1.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14  bytes 1032 (1.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.56.217.73  netmask 255.255.255.0  broadcast 10.56.217.255
        inet6 fe80::ec2b:e7ff:fe1d:2f43  prefixlen 64  scopeid 0x20<link>
        ether ee:2b:e7:1d:2f:43  txqueuelen 0  (Ethernet)
        RX packets 742  bytes 47024 (45.9 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 104  bytes 8294 (8.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Pod1 and pod2 can ping through the net1 network card

When I set the master to the rdma NIC of one of these two sets of servers it works fine, can anyone explain why?

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  creationTimestamp: "2024-06-04T06:20:31Z"
  generation: 4
  name: macvlan-cx5-bond-conf
  namespace: default
  resourceVersion: "32903014"
  uid: 062f5ed1-0e94-4ff0-aef0-bde7a2eb2053
spec:
  config: '{ "cniVersion": "0.3.1", "type": "macvlan","master": "ens61f0np0","ipam":
    { "type": "host-local", "subnet": "10.56.217.0/24", "rangeStart": "10.56.217.71",
    "rangeEnd": "10.56.217.81", "routes": [ { "dst": "0.0.0.0/0" } ], "gateway": "10.56.217.1"
    } }'

macvlan master interface needs to be the netdevice of the RDMA capable NIC. it uses that to generate GID IIRC.