gardener/machine-controller-manager

Scale down of machineset during rolling update

Closed this issue · 2 comments

How to categorize this issue?

/area auto-scaling
/kind bug
/priority 2

What happened:

During a rolling update of our MachineDeployments we saw a machineset very rapidly scale down to 0. This is very similar to #802 but happened in a slightly more gradual manner:

I0606 10:36:39.331659 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 0, deleting 3
I0606 10:36:26.699210 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 3, deleting 1
I0606 10:36:20.485488 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 4, deleting 7
I0606 10:36:09.951122 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 11, deleting 1
I0606 10:36:03.741863 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 12, deleting 9
I0606 10:35:44.584218 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 21, deleting 3
I0606 10:35:39.340453 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 24, deleting 5
I0606 10:35:32.284172 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 29, deleting 1
I0606 10:35:27.788659 1 machineset.go:419] Too many replicas for machine-controller-manager/cluster-autoscaler-c5-4xlarge-az-a-776bd, need 30, deleting 1

What you expected to happen:

Cluster capacity should be maintained during a rolling update

How to reproduce it (as minimally and precisely as possible):

Simply triggering a machine deployment is all that caused this. We can grab more contextual logs for you to help out with a reproducer if needs be

Anything else we need to know?:

I'm not sure if this might be a CA bug rather than MCM as I'm led to believe that there have been/are a number of bugs whereby MCM & CA can step on each other's toes during a rolling update. We're currently unable to upgrade CA due to having to migrate away from AWSMachineClasses to MachineClasses. That migration in itself will cause us to run into this issue which is a little frustrating.

Environment:

  • Kubernetes version (use kubectl version): 1.22.17
  • Cloud provider or hardware configuration: AWS
  • Others: MCM-0.48.2, CA-0.18.0

After some more investigation, it definitely looks like we're being hit by gardener/autoscaler#118 (and, as a consequence, gardener/autoscaler#181. In our particular scenario, during a MachineDeployment rollingUpdate, AWS is unable to provision instances due to capacity issues in eu-central-1a. After 10 minutes those are detected as unregistered by CA which then hits those bugs. Closing on this side, and we'll eagerly watch those CA bugs for updates.

History

There was an issue which happened due to CA-MCM not being able to correctly remove the unregistered machine in certain corner cases (a shortcoming of our CA-MCM interaction for targeted removal of machine) . If two machineSets are present for a machinedeployment (in case of rolling-update) , and CA reduces replicas of the machineDeployment to remove a particular machine, then MCM could scale down any machineSet. This was dangerous and removing nodes which had rolled to latest sometimes.

Steps taken

We tried to deal with this in the best way possible by making changes on two levels:

  • CA to NOT direct any kind of scale-down / removal of machine during a rolling update (gardener/autoscaler#160)
  • MCM to remove only from old machineSets on scale-down , while scale up only new machineSet on scale-up (#765)

in your case , you are using CA-MCM combination where MCM change is present but CA change is absent, so CA is scaling down in rolling update , and MCM is only removing from old-machineSet. So its not a re-occurence of gardener/autoscaler#118

We actively support latest 3 CA versions, so kindly update to them, and your problem should get resolved.