openshift/machine-config-operator

Extensions enablement leave the node in a state where `rpm-ostree install` doesn't work.

fidencio opened this issue · 6 comments

Description

After a package gets layered due to the "Extensions" mechanism, manually rpm-ostree installing another package becomes impossible.

Steps to reproduce the issue:

  1. Deploy an operator that forces a package to be installed via the extension mechanism (I deployed the sandboxed-containers operator)
  2. Connect to a worker node via your preferred method
  3. Add a .repo file to /etc/yum.repos.d/
  4. Try to install a package from that repo

Describe the results you received:

You'll get a message like:

sh-4.4# rpm-ostree install $package
Checking out tree 6bc3ddd... done
Enabled rpm-md repositories: $package-repo
rpm-md repo '$package-repo' (cached); generated: 2021-11-23T05:13:43Z
Importing rpm-md... done
error: Packages not found: kata-containers

Describe the results you expected:

Be able to rpm-ostree install $package without issues (assuming all its dependencies are provided ;-))

Additional information you deem important (e.g. issue happens only occasionally):

Happens 100% of the time.

Output of oc adm release info --commits | grep machine-config-operator:

$ oc adm release info --commits | grep machine-config-operator
  machine-config-operator                        https://github.com/openshift/machine-config-operator                        a1e3bbfc59d48997e888727fac9ac227bb32732

Additional environment details (platform, options, etc.):

The issue seems to happen as the repo used for installing the packages provided by the extension is simply removed after the extensions' packages are installed.

I had a quick chat with @cgwalters on #fedora-coreos channel, which led me to open the issue here, and here's the transcription of the chat:

19:59 <  fidencio> so, first of all, I know this is a OCP + Extensions specific question, and I know this is FCOS channel, sorry about that.  Still, I'll give it a try and ask here in case someone is around and willing to chat (if not, please, don't kick me out and I'll just open an issue for OCP)
20:00 <  fidencio> seems that enabling an extension will lead RHCOS inside the node to be in a state where it can't install any other packages via `rpm-ostree install`, as the repo from where the extension package was installed gets removed after the installation happens
20:00 <  fidencio> I know, one should *not* be calling `rpm-ostree install` in a RHCOS node, I really know
20:01 <  fidencio> but breaking `rpm-ostree install`ability due to extensions enablement also seems .... unexpected.
20:02 <  fidencio> as walters is everywhere possible, I found https://github.com/coreos/rpm-ostree/issues/1602 which is slightly related, and his explanation makes sense in that context
20:02 <+  walters> I think you're hitting 
https://github.com/openshift/machine-config-operator/blob/c415ce6aed25604bc1d2478951db16759dac31f6/templates/common/_base/units/machine-config-daemon-firstboot.service.yaml#L17 ?
20:03 <  fidencio> walters: I wish, with that i'd at least have a workaround, but there's no repo under /etc/yum.repos.d
20:03 <+  walters> fidencio: this is only tangentially related but as of lately we've been working on https://github.com/coreos/enhancements/blob/main/os/coreos-layering.md and https://fedoraproject.org/wiki/Changes/OstreeNativeContainer  FYI
20:03 <  fidencio> walters: it seems that whatever repo gets enabled as extension, gets also removed after the reboot
20:04 <+  walters> oh right sorry yes...I understand, that is indeed right, the repo is removed
20:04 <  fidencio> walters: so, this is ... expected?
20:05 <+  walters> it is the status quo yeah, a bit messy to fix but possible
20:06 <  fidencio> walters: hmmm. I'm not sure if I am that comfortable with that status quo
20:07 <  fidencio> walters: maybe my use case is wrong from the beginning, to be fair, but I'm actually testing a lightweight VMM (cloud-hypervisor) with the sandboxed-containers stuff, which installs kata-containers on the worker nodes
20:10 <  fidencio> walters: do you think it deserves at least an issue open?
20:10 <+  walters> definitely
20:11 <  fidencio> walters: okay, i will open the issue against the MCO then, and CC you there
20:11 <  fidencio> walters: thanks a lot for the help, and for the pointers

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.