openshift/machine-config-operator

Deleting old rendered MachineConfigs during an update makes recovery super hard.

m-yosefpor opened this issue · 11 comments

Description

When a machineConfig is changed, a new rendered machineConfig is generated by mc-controller, and nodes start to rollout from current = old rendered-machineconfig to desired = new rendered-machineconfig. However if during MCP updates, one deletes the current = old rendered machineConfig, all nodes turn into degraded, also machineconfig-server will respond with HTTP 500 error as old rendered-machineconfig is not found, and also those files will no longer be generated by machineconfig-controller, as they only generate new rendered ones. So the MCO cannot manage to turn the cluster to the desired state, and if user does not have a backup of old rendered machine config, they cannot make the cluster healthy again.

Even recreating is not an option since machineconfig-server is not responding.
Also creating a machineconfig with the expected name of old machine config will not help as current state will not match the state on disk.

Steps to reproduce the issue:

  1. change a machineConfig so a new rendered machine config is generated. oc edit mc <sth>
  2. delete old rendered machine congis oc delete mc <old-rendered-mc>

Describe the results you received:

Describe the results you expected:

MCO should be able to put the cluster in desired state, and be more stable to such machineConfig resources.

Additional information you deem important (e.g. issue happens only occasionally):

Output of oc adm release info --commits | grep machine-config-operator:

(paste your output here)

Additional environment details (platform, options, etc.):

What is the reason that you are deleting the current rendered config during an update? This is not supported behavior and would be dangerous as the pool is in an intermediate state where some nodes indeed need to be and are on/using the "current" MC unless and until it is their turn to be upgraded to the desired config. You should allow updates to proceed and finish without removing the current rendered config that the nodes are still using.

What I was trying to illustrate here (by deleting the old MCs during a MCD update) was only to show how fragile the MCO operations are. OKD is already stable to lots of changes/deletion on applied objects (which can be caused by human errors, or some component misconfiguration such as argocd or a 3rd party controller).

Deleting many of the cluster deployments/services/Routes/CRs will not cause much issues for OKD, as related operators will regenerate the required manifests to put the cluster in the desired state. Even if the configuration of the responsible operator is broken, CVO will first heal the CO and then the CO can heal other affected resources, and finally the cluster is expected to be in the desired healthy state. However, MCO is fragile in this manner, and this serious dependency on old MCs during a node ignition update (which can not be regenrated by CO) can put the cluster in an unrecoverable state.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/remove-lifecycle stale

This issue happened with us also.
We wanted to delete an older MCP which was not required. Our engineer deleted the MC without waiting for the complete rollout of the new MCP. Now the nodes are in Not Ready state.

As mentioned earlier by Kirsten, deleting rendered config is not recommended by MCO, see https://docs.openshift.com/container-platform/4.8/post_installation_configuration/machine-configuration-tasks.html#checking-mco-status_post-install-machine-configuration-tasks. Perhaps we can try to improve our doc so that it is more clear. In future we would like to make user experience better around it once we have some capacity.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

So, I have no idea why you would want to delete the rendered MachineConfig to begin with. But even if you did, then the old MachineConfig is stored on each of the nodes /etc/mcs-machine-config-content.json:

❯ ssh core@192.168.1.132 cat /etc/mcs-machine-config-content.json | jq .metadata.name
"rendered-worker-c450e0e298a9237016db6339c88e1b43"

You can just recreate it using that file in the worst case scenario. But I'm not sure if we should expect MCO to account of manual user deletion of objects required to manage the cluster. I personally think that we would be much safer sticking with a failure scenario in this case. For the sake of cluster stability, if there are required objects missing, the environment should probably be reviewed by the administrator.

Maybe a better solution to this problem would be documenting the existence of that file, along with some general MCO troubleshooting guide?

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.