coreos/container-linux-update-operator

Select which nodes are rebootable

jamiehannaford opened this issue · 2 comments

We're using this operator to coordinate reboots across clusters, but one thing we're seeing is race conditions where the node is being rebooted before essential pods are up. The reboot can sometimes happen immediately after the operator pod is running.

One easy solution is to delay the operator pod deployment until the very last step, but it'd be nicer to have more granular control over which nodes are rebootable. Currently it seems that the agent DS is deployed to all nodes, which is a bit of an assumption. Can we use CRDs to define granularity?

You mention 3 different topics, but the title desire to specify which nodes update-agent should run on is discussed in #76.

Can you try using the updated examples? You can now deploy the update-operator deployment and update-agent DaemonSet manifests directly and add a node selector to the Daemonset to have fine-grained control over where it is scheduled.

The old behavior where update-operator creates the DaemonSet on your behalf is being deprecated. In Tectonic clusters, this migration will be done with the new update-operator --auto-label-container-linux compatibility flag which will apply a conventional label container-linux-update.v1.coreos.com/agent=true on Container Linux nodes.

Please re-open if this doesn't solve your issue.