gardener/machine-controller-manager

Graceful scaledown of deprecated MachineDeployment

mattburgess opened this issue · 7 comments

How to categorize this issue?

/area usability
/kind enhancement
/priority 3

What would you like to be added:

We'd like a way to be able to gracefully scale a MachineDeployment down to 0, specifically without assuming that PDBs will protect pod availability.

Why is this needed:

From time to time we have a need to completely remove a MachineDeployment from our clusters. Ideally we'd run something like kubectl -n machine-controller-manager scale machinedeployment my-md --replicas 0 and just let MCM control things. However, this can lead to undesirable consequences:

  1. If that MachineDeployment is managing more than x% of the cluster's capacity then it can lead to more pods being evicted than our buffer pods are reserving capacity for, meaning outages whilst we wait for CA to notice unschedulable pods and bring more nodes in
  2. If a particular pod doesn't have a PDB set, or the PDB is misconfigured (e.g. maxUnavailable: 0) then the nodes running those pods will hit the eviction timeout of 10 mins and all will be terminated at the same time leading to a loss of service. Unfortunately, we're not in control of those PDBs and despite some efforts to ask for them to be adjusted we can't be guaranteed that they will be.

In an ideal scenario I'd quite like the following workflow:

  1. I can mark the MachineDeployment in some way so that CA ignores it for any scaling decisions (is cluster-autoscaler.kubernetes.io/scale-down-disabled: true sufficient? Does that also tell it to not scale up either?)
  2. I can mark the MachineDeployment in some way so that MCM knows it needs to be gracefully drained
  3. MCM proceeds to scale down the MachineDeployment x (user configurable) nodes at a time. It waits for that scaledown to complete and for the number of unschedulable pods to be < y (user configurable) before proceeding with the next iteration of the scaledown loop

I can mark the MachineDeployment in some way so that CA ignores it for any scaling decisions

you can remove the machineDeployment from the --nodes flag of autoscaler

is cluster-autoscaler.kubernetes.io/scale-down-disabled: true sufficient? Does that also tell it to not scale up either?

it is a per node annotation and doesn't tell autoscaler anything about the node group , so scale up for node group would still happen

It waits for that scaledown to complete and for the number of unschedulable pods to be < y (user configurable) before proceeding with the next iteration of the scaledown loop

We have written MCM to work with CA . CA deals with unschedulable pods and directs MCM to scale a particular node group up or down. MCM only deals with machine in terms of their count , so it will just complicate things if we make MCM smart enough .
(plus it will be quite complicated in situations where more unschedulable pods are coming in which would stop the scale-down , so it cannot be generalized)

MCM proceeds to scale down the MachineDeployment x (user configurable) nodes at a time

This can be done using a script also where you issue the command

kubectl -n machine-controller-manager scale machinedeployment my-md --replicas <replicas>

while keeping a note of the available machine in the deployment. Too much configurability from our side is not required

is cluster-autoscaler.kubernetes.io/scale-down-disabled: true sufficient? Does that also tell it to not scale up either?

you can achieve this by adding a taint to all nodes of machineDeployment which your pods don't tolerate. To do so add the taint in spec.template.spec.nodeTemplate.taints section.

@mattburgess ℹ️ please take some time to help himanshu-kun or redirect to someone else if you can't.

MCM only deals with machine in terms of their count , so it will just complicate things if we make MCM smart enough .

Yeah, understood. Thanks for the detailed response. We were trying to avoid having to write our own scale-down utility but it looks like that might be unavoidable.

you can remove the machineDeployment from the --nodes flag of autoscaler

It's a shame that node group auto-discovery hasn't been plugged in yet as doing this obviously requires a code change + redeployment on our side. We may look at contributing auto-discovery if it isn't already being looked at?

It'd be nice, then, if CA supported such node-group deprecation. That way our migration of MDs/node-groups would look like this:

  1. Create new MachineDeployment (with CA node-group auto-discovery wired up we'd not need to make any changes to CA config)
  2. Add a label to the old MachineDeployment to signal to CA that it should no longer consider the related node-group for scale-up, and to gracefully scale the node-group to 0
  3. Delete the old MachineDeployment

Do you think that's a reasonable request that might be considerd on the CA side? Either way, I'm happy for this to be closed, and we'll deal with the scale down on our side for now.

It's a shame that node group auto-discovery hasn't been plugged in yet as doing this obviously requires a code change + redeployment on our side. We may look at contributing auto-discovery if it isn't already being looked at?

We also wanted to implement it, but because of low demand and us having our hands full ,we had iceboxed the issue gardener/autoscaler#29

Your contributions are welcomed. Pls comment on the issue about how you want to implement it , and then we can discuss there
Kindly close this issue if there are no further queries.

Thanks again for the feedback @himanshu-kun.