Cannot config Mellanox ConnectX-4 Lx on CoreOS
AyaSenri opened this issue · 9 comments
Description of problem:
When created SriovNetworkNodePolicy with ConnectX-4 Lx , the VF cannot be initialized.
Its similar #43
Steps to Reproduce:
- install sriov operator by 'make deploy-setup'
- create SriovNetworkNodePolicy on ConnectX-4 Lx, like this:
oc label sriov3 on master3
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
namespace: openshift-sriov-network-operator
spec:
deviceType: vfio-pci
mtu: 1500
nicSelector:
rootDevices:
- 0000:3b:00.1
nodeSelector:
sriov3: "true"
numVfs: 2
priority: 90
resourceName: mlxnics
- check nodestates && sriov config daemon logs
Actual results:
oc logs sriov-network-config-daemon-6nhh7 -n openshift-sriov-network-operator -c sriov-network-config-daemon
I0907 09:51:27.061528 2912728 mellanox_plugin.go:227] mellanox-plugin getMlnxNicFwData(): device 0000:3b:00.1
I0907 09:51:27.061531 2912728 mellanox_plugin.go:220] mellanox-plugin mstConfigReadData(): device 0000:3b:00.1
I0907 09:51:27.061533 2912728 utils.go:740] RunCommand(): mstconfig [-e -d 0000:3b:00.1 q]
I0907 09:51:27.061555 2912728 utils.go:747] RunCommand(): mstconfig, [-e -d 0000:3b:00.1 q]
I0907 09:51:27.066132 2912728 utils.go:749] RunCommand(): -E- Failed to open the device
, exit status 3
E0907 09:51:27.066157 2912728 mellanox_plugin.go:233] mellanox-plugin getMlnxNicFwData(): failed exit status 3
E0907 09:51:27.066164 2912728 daemon.go:462] nodeStateSyncHandler(): plugin mellanox_plugin error: exit status 3
[core@master3 ~]$ lspci -nn | grep Mellanox
3b:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
3b:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
86:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
86:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
oc exec -it sriov-network-config-daemon-6nhh7 -n openshift-sriov-network-operator -c sriov-network-config-daemon /bin/bash
bash-4.4$ cat /proc/mounts | grep sysfs
sysfs /sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
sysfs /host/sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
sysfs /host/var/lib/containers/storage/overlay/fe940c1006177c7ad647ecba31380fc5c0798806b1de23997f2058768405dbae/merged/sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
bash-4.4$ mstconfig -e -d 0000:3b:00.1 q
-E- Failed to open the device
[core@master3 ~]$ uname -r
5.14.14-200.fc34.x86_64
[core@master3 ~]$ cat /etc/os-release
NAME=Fedora
VERSION="34 (CoreOS)"
ID=fedora
VERSION_ID=34
VERSION_CODENAME=""
PLATFORM_ID="platform:f34"
PRETTY_NAME="Fedora CoreOS 34"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:34"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=34
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=34
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='49.34.202202100232-0'
DEFAULT_HOSTNAME=localhost
I found some information from Troubleshooting.
“Tools PCI semaphore might be locked due to unexpected process shutdown. ”
Would it be the cause?
“Run the following command:
mcra -c <mst_pci_device>
*Supported on MFT-4.4.0 and newer versions.”
We install mstflint but Everything MFT,so i cannot found mcra in the config daemon container.
You might need a newer "mstconfig"/"mstflint" version. You should try to update or compile mstflint from source. Then you can try querying your device.
@AyaSenri what is the image of the sriov-network-config-daemon that you are using?
@AyaSenri what is the image of the sriov-network-config-daemon that you are using?
I found the cause. i run sriov-network-operator on okd version 4.9.
- The command
mstconfig q
works on the bare-metal with user root, well i install mstflint by rpm-tree. - So i try to run the command in container, but it still error. I think security cause this error.
- I found SELINUX is enforcing in okd but disabled in kubernetes. It works well when i set SELINUX disabled.
setenforce 0
podman run --network=host --pid=host --user=root --volume=/:/host --privileged --rm -it quay.io/openshift/origin-sriov-network-config-daemon:4.9 /bin/bash
mstconfig -e -d 0000:3b:00.1 q
this is weird the selinux should not block the mstconfig binary we have a coverage for this on our tests.
can you please share from the host the /var/log/audit/audit.log
file after you run the sriov operator and it complains about the error
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.