Consider failed machine as terminating

Question

Consider failed machine as terminating

himanshu-kun opened this issue 2 years ago · 14 comments

What happened:
A race condition could happen b/w MCM and CA in case of node not joining in 20 min (default creationTimeout and maxNodeProvisionTimeout)
This could lead to MCM deleting Failed machine and creating a new one, and then CA deleting the new one thinking it deleting the Failed machine.

What you expected to happen:
CA should not take action if the machine it wants to terminate is already terminating.

How to reproduce it (as minimally and precisely as possible):

Make CA to scale up a nodegrp
make the machine which got created not join within 20 min. This could be achieved by deleting the VM as soon as its created or tweaking the router rules. Important bit is that node shouldn't get registered
keep the default configurations for both MCM and CA

Anything else we need to know:
This happens because of delay of machineSet controller marking a Failed machine with deletionTimestamp and so the machine is not considered as terminating.

Environment:

Answer 1 · 2022-05-10T13:12:25.000Z

Hello , Can we have an update on the status ?

Answer 2 · 2022-05-11T04:41:20.000Z

This would be available from release 1.22 of CA

Answer 3 · 2022-06-01T07:25:02.000Z

Hello @himanshu-kun, do you know when 1.22 will be rolled out?

Answer 4 · 2022-06-01T09:38:36.000Z

Hi @bbochev , I am planning to make release next week after fix for this issue. Then to reach Canary landscape it could take one more week.

Answer 5 · 2022-06-14T07:11:26.000Z

Hello @himanshu-kun,
Could you please update us here - does the new release of CA was implemented successfully on canary?

Answer 6 · 2022-06-14T07:52:24.000Z

Hi @bbochev
Apologies, I was not able to work on it this week due to some other major fixes. I'll start working on it now. Then I'll make the release.

Answer 7 · 2022-06-22T04:47:04.000Z

After looking more into the issue, we realized that it needs a bigger change , where the state transition of machines and communication between both CA and MCM have to be redesigned.
I see that from the live ticket https://github.tools.sap/kubernetes-live/issues-live/issues/1607 the question was more on why two scale-up attempts happened. The system came back to normal state on its own within an hour.
To deal with this for now, customer can keep the maxNodeProvisionTime different (like 19min) than the machineCreationTimeout which is by default 20min. The trick is not to keep them the same.
cc @unmarshall

Answer 8 · 2022-06-22T04:47:07.000Z

@himanshu-kun You have mentioned internal references in the public. Please check.

Answer 9 · 2022-06-22T04:47:28.000Z

@himanshu-kun You have mentioned internal references in the public. Please check.

Answer 10 · 2023-02-28T10:59:55.000Z

Background

The cluster autoscaler as part of its regular RunOnce loop calls deleteCreatedNodesWithErrors which in turn calls nodeGroup.DeleteNodes(nodesToBeDeleted).
The MCM implementation of NodeGroup.DeleteNodes(nodesToBeDeleted) obtains the machine objects associated with nodesToBeDeleted and then calls mcmManager.DeleteMachines(machines)
DeleteMachines(machines) iterates through each machine object, obtains its machine deployment and uses the following logic to set the replicas in the machine deployment

expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingMachines))
if expectedReplicas == mdclone.Spec.Replicas {
	klog.Infof("MachineDeployment %q is already set to %d, skipping the update", mdclone.Name, expectedReplicas)
			break
}
mdclone.Spec.Replicas = expectedReplicas

As can be seen above, we have a problem: the expectedReplicas is lowered since we never consider Failed machines in the above logic.

Grooming Decisions

For solving the problem specifically mentioned in this issue, we believe that the expectedReplicas should be computed as:

expectedReplicas := mdclone.Spec.Replicas - int32(len(machines)) + int32(len(terminatingOrFailedMachines))

We understand and acknowledge that this proposed fix (which was also submitted as a PR earlier in this thread) will only address one problem and not other edge cases where we don't see the most optimal scaling. (Machines in Pending phase causing deployment replica count to be lowered, but the machine could have become running the next minute, but now autoscaler would try a new node grp and wait for the node to join there). A separate issue for generic autoscaler<->MCM edge cases is raised for this. #181

Answer 11 · 2023-06-08T11:13:24.000Z

To deal with this for now, customer can keep the maxNodeProvisionTime different (like 19min) than the machineCreationTimeout which is by default 20min.

We're not using Gardener, but are using MCM and hence Gardener's CA fork too. We have these settings:

CA - --max-node-provision-time=10m0s
MCM - --machine-drain-timeout=10m & --machine-health-timeout=10m (note, specifically, that we don't override MCM's --machine-creation-timeout)

In our specific case, we run into this bug when AWS can't provision nodes due to capacity issues. CA's unregistered node handler then kicks in after 10 minutes and triggers this bug. We also only appear to hit it during an rolling update of a MachineDeployment; simply causing CA to scale up doesn't appear to hit it.

Is the right workaround here to adjust MCM's --machine-creation-timeout to be less than CA's --max-node-provision time so that MCM can clean up the failed machine before CA's unregistered node handler has a chance to kick in? Your guidance above has CA set to 19m and MCM to its default of 20m but in our case we're on 10m and 20m respectively, so even though they're different we're still hitting this bug.

Answer 12 · 2023-06-12T09:12:15.000Z

I don't know whether it helps at all, but my testing has shown that even the band-aid fix isn't helpful in our specific case.

When AWS is under capacity pressure (particularly prevalent in eu-central-1a) then we get an error back from the AWS API. This causes the Machine to enter a CrashLoopBackOff state, hence why the patch doesn't help us. That also explains why our adjustment of --machine-creation-timeout doesn't appear to help avoid the race either; that flag only appears to help when the Machine can reach a Pending status, i.e. the cloud provider can provide a backing instance but something else (networking, bootstrapping, etc.) prevents the Machine from reaching a Running state.

In short, whatever approach is used to address this bug will need to cater for both of those scenarios.

Answer 13 · 2023-07-03T03:10:23.000Z

in our case we're on 10m and 20m respectively, so even though they're different we're still hitting this bug.

the explanation still hold in your above scenario

that flag only appears to help when the Machine can reach a Pending status

no this flag works for CrashLoopBackoff phase also , though there are some improvements we have to make there, but that shouldn't be causing this issue. This issue is only caused on race. Let me know if you feel otherwise.