openshift/machine-config-operator

rpm-otreeed.service missing configuration proxy

zhouhao3 opened this issue · 1 comments

Description

We use openshift 4.12.0-ec.4 for IPI deployment. The Master can be successfully deployed, but the crio.service on master fails to start, this caused the IPI deployment to fail.The reasons are as follows:

Our investigation found that the reason for the failure of crio.service was that the machine-config-daemon-firstboot.service it depends on failed. The relevant information is as follows:

● machine-config-daemon-firstboot.service - Machine Config Daemon Firstboot
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-firstboot.service; enabled; vendor preset: enabled)
   Active: activating (start) since Mon 2022-10-17 09:20:48 UTC; 17h ago
Main PID: 3674 (machine-config-)
    Tasks: 35 (limit: 406926)
   Memory: 47.1M
      CPU: 4min 20.173s
   CGroup: /system.slice/machine-config-daemon-firstboot.service
           └─3674 /run/bin/machine-config-daemon firstboot-complete-machineconfig
Oct 18 03:03:52 master-1 machine-config-daemon[3674]: I1018 03:03:52.904189    3674 rpm-ostree.go:447] Running captured: rpm-ostree --version
Oct 18 03:03:52 master-1 machine-config-daemon[3674]: I1018 03:03:52.929770    3674 rpm-ostree.go:407] Executing rebase to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661
Oct 18 03:03:52 master-1 machine-config-daemon[3674]: I1018 03:03:52.929786    3674 update.go:2053] Running: rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661
Oct 18 03:03:52 master-1 machine-config-daemon[61425]: Pulling manifest: ostree-unverified-image:docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661
Oct 18 03:03:53 master-1 machine-config-daemon[3674]: I1018 03:03:53.074957    3674 update.go:1243] Updating files
Oct 18 03:03:53 master-1 machine-config-daemon[3674]: I1018 03:03:53.074977    3674 update.go:1308] Deleting stale data
Oct 18 03:03:53 master-1 machine-config-daemon[3674]: I1018 03:03:53.074985    3674 update.go:2098] Removing SIGTERM protection
Oct 18 03:03:53 master-1 machine-config-daemon[3674]: W1018 03:03:53.074994    3674 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 : **error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661: error: remote error: pinging container registry quay.io: Get "[https://quay.io/v2/":](https://quay.io/v2/%22:) dial tcp: lookup quay.io on 192.168.30.1:53: no such host**
Oct 18 03:03:53 master-1 machine-config-daemon[3674]: : exit status 1
Oct 18 03:03:53 master-1 machine-config-daemon[3674]: I1018 03:03:53.075000    3674 firstboot_complete_machineconfig.go:47] Sleeping 1 minute for retry

We can see that the reason for the error is the rpm-ostree not configuring the proxy when executing the rebase command.
We tried manually configuring the proxy for rpm-ostree and it worked.

In addition, we found that in the normal version (4.11.1), machine-config-daemon-firstboot.service will not execute the rpm-ostree rebase command. The specific information is as follows:

● machine-config-daemon-firstboot.service - Machine Config Daemon Firstboot
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-firstboot.service; enabled; vendor preset: enabled)
   Active: activating (start) since Tue 2022-10-18 07:21:02 UTC; 1min 25s ago
Main PID: 3825 (machine-config-)
    Tasks: 54 (limit: 406941)
   Memory: 514.7M
      CPU: 14.980s
   CGroup: /system.slice/machine-config-daemon-firstboot.service
           ├─3825 /run/bin/machine-config-daemon firstboot-complete-machineconfig
           └─3972 oc image extract --path /:/run/mco-machine-os-content/os-content-41001907 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e7c8e9e407ebab51eac2482d13c07d071c0be1a5755a36a64f0be1b73b3999a
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.303454    3825 update.go:1976] Running: systemctl start rpm-ostreed
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.459377    3825 rpm-ostree.go:324] Running captured: rpm-ostree status --json
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.518682    3825 rpm-ostree.go:324] Running captured: rpm-ostree status --json
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.562059    3825 daemon.go:236] Booted osImageURL:  (411.86.202207150124-0)
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.563609    3825 update.go:2013] Adding SIGTERM protection
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.564265    3825 update.go:513] Checking Reconcilable for config mco-empty-mc to rendered-master-e08a90a8cf8f7f4f823348adf310f481
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.565941    3825 update.go:1991] Starting update from mco-empty-mc to rendered-master-e08a90a8cf8f7f4f823348adf310f481: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.570660    3825 update.go:1207] Updating files
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.570677    3825 update.go:1272] Deleting stale data
Oct 18 07:21:02 master-0 machine-config-daemon[3825]: I1018 07:21:02.570802    3825 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-41001907 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6e7c8e9e407ebab51eac2482d13c07d071c0be1a5755a36a64f0be1b73b3999a

Therefore, we think that the rpm-ostree rebase command should be executed from a certain version of machine-config-daemon-firstboot.service, but the corresponding proxy configuration has not been added, which caused the problem.

Steps to reproduce the issue:

  1. openshift-baremetal-install --dir ~/clusterconfigs create manifests
  2. openshift-baremetal-install --dir ~/clusterconfigs --log-level debug create cluster

Describe the results you received:

DEBUG Log bundle written to /var/home/core/log-bundle-20221012071722.tar.gz
WARNING Unable to stat /home/kni/clusterconfigs/serial-log-bundle-20221012071722.tar.gz, skipping
ERROR Bootstrap failed to complete: timed out waiting for the condition
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.              
INFO Bootstrap gather logs captured here "/home/kni/clusterconfigs/log-bundle-20221012071722.tar.gz"

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

machine-config-operator image info:

quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1650adb41efbe7287997152c74850a410ba0a5eb2d3ab9c7723d144e7985de5

See issue 6482 for more details.

Output of oc adm release info --commits | grep machine-config-operator:

(paste your output here)

Additional environment details (platform, options, etc.):

Hi, PR in #3377