gardener/machine-controller-manager

Edge Case:- Rolling Update scales all machine sets to `0`

rishabh-11 opened this issue · 4 comments

How to categorize this issue?

/area robustness
/kind bug
/priority critical

What happened:
During the live update, we observed that both new and old machine sets were scaled down to 0. This happened because during the rolling update, the number of machines for the new machine set reached the allowedLimit and so it cannot add more machines to it code link. If we look at the code here, the nameToSize map is not populated for the corresponding machineSet, and here, we call to scale the machine set and pass 0 as a scale value.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
The PR which caused this regression - #765

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

/blocker

We know of this catastrophic bug since 2d?

Yes work is ongoing, will raise PR by noon.

What is fastest, HotFix PR or do you have a short-term mitigation strategy that Gardener operators can apply? We probably cannot change the max settings across all clusters and pools, we cannot really disable reconciliation everywhere... is there anything an operator can do in the meantime to protect its clusters until the hotfix becomes available?

I mean, it’s a total melt-down, one of the worst events that can happen (besides losing ETCD, of course). MCM breaks absolutely everything in such a cluster. That’s pretty worrisome.

PR raised #803 , in process of merge and cherry-pick
No I can't think of any way to stop this . Didn't get time to think when it could occur, from landscape perspective.