openshift/sriov-network-operator

Cannot config Mellanox ConnectX-4 Lx on CoreOS

AyaSenri opened this issue · 9 comments

Description of problem:
When created SriovNetworkNodePolicy with ConnectX-4 Lx , the VF cannot be initialized.
Its similar #43

Steps to Reproduce:

  1. install sriov operator by 'make deploy-setup'
  2. create SriovNetworkNodePolicy on ConnectX-4 Lx, like this:
    oc label sriov3 on master3
kind: SriovNetworkNodePolicy
metadata:
  name: policy-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  mtu: 1500
  nicSelector:
    rootDevices:
    - 0000:3b:00.1
  nodeSelector:
    sriov3: "true"
  numVfs: 2
  priority: 90
  resourceName: mlxnics
  1. check nodestates && sriov config daemon logs

Actual results:
oc logs sriov-network-config-daemon-6nhh7 -n openshift-sriov-network-operator -c sriov-network-config-daemon

I0907 09:51:27.061528 2912728 mellanox_plugin.go:227] mellanox-plugin getMlnxNicFwData(): device 0000:3b:00.1
I0907 09:51:27.061531 2912728 mellanox_plugin.go:220] mellanox-plugin mstConfigReadData(): device 0000:3b:00.1
I0907 09:51:27.061533 2912728 utils.go:740] RunCommand(): mstconfig [-e -d 0000:3b:00.1 q]
I0907 09:51:27.061555 2912728 utils.go:747] RunCommand(): mstconfig, [-e -d 0000:3b:00.1 q]
I0907 09:51:27.066132 2912728 utils.go:749] RunCommand(): -E- Failed to open the device
, exit status 3
E0907 09:51:27.066157 2912728 mellanox_plugin.go:233] mellanox-plugin getMlnxNicFwData(): failed exit status 3
E0907 09:51:27.066164 2912728 daemon.go:462] nodeStateSyncHandler(): plugin mellanox_plugin error: exit status 3
[core@master3 ~]$ lspci -nn | grep Mellanox
3b:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
3b:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
86:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
86:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

oc exec -it sriov-network-config-daemon-6nhh7 -n openshift-sriov-network-operator -c sriov-network-config-daemon /bin/bash

bash-4.4$ cat /proc/mounts | grep sysfs
sysfs /sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
sysfs /host/sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
sysfs /host/var/lib/containers/storage/overlay/fe940c1006177c7ad647ecba31380fc5c0798806b1de23997f2058768405dbae/merged/sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
bash-4.4$ mstconfig -e -d 0000:3b:00.1 q
-E- Failed to open the device
[core@master3 ~]$ uname -r
5.14.14-200.fc34.x86_64

[core@master3 ~]$ cat /etc/os-release
NAME=Fedora
VERSION="34 (CoreOS)"
ID=fedora
VERSION_ID=34
VERSION_CODENAME=""
PLATFORM_ID="platform:f34"
PRETTY_NAME="Fedora CoreOS 34"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:34"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=34
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=34
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='49.34.202202100232-0'
DEFAULT_HOSTNAME=localhost

I found some information from Troubleshooting.
“Tools PCI semaphore might be locked due to unexpected process shutdown. ”
Would it be the cause?

“Run the following command:

mcra -c <mst_pci_device>

*Supported on MFT-4.4.0 and newer versions.”
We install mstflint but Everything MFT,so i cannot found mcra in the config daemon container.

@AyaSenri

You might need a newer "mstconfig"/"mstflint" version. You should try to update or compile mstflint from source. Then you can try querying your device.

@AyaSenri what is the image of the sriov-network-config-daemon that you are using?

@AyaSenri what is the image of the sriov-network-config-daemon that you are using?

@wizhaoredhat @SchSeba

I found the cause. i run sriov-network-operator on okd version 4.9.

  1. The command mstconfig q works on the bare-metal with user root, well i install mstflint by rpm-tree.
  2. So i try to run the command in container, but it still error. I think security cause this error.
  3. I found SELINUX is enforcing in okd but disabled in kubernetes. It works well when i set SELINUX disabled.
setenforce 0
podman run --network=host --pid=host --user=root --volume=/:/host --privileged --rm -it quay.io/openshift/origin-sriov-network-config-daemon:4.9 /bin/bash
mstconfig -e -d 0000:3b:00.1 q

this is weird the selinux should not block the mstconfig binary we have a coverage for this on our tests.

can you please share from the host the /var/log/audit/audit.log file after you run the sriov operator and it complains about the error

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.