aws/aws-node-termination-handler

Group and queue nodes for termination

stevehipwell opened this issue · 10 comments

Describe the feature
I'd like NTH to be able to group nodes (similar to the CA --balance-similar-node-groups) and support processing n nodes per group (this can still be constrained by the workers configuration).

I assume that v2 would be designed around this kind of concept, but I think it'd be worth doing in v1 assuming it wouldn't take too much effort.

Is the feature request related to a problem?
When using NTH to manage ASG instance refresh events it is very easy to get a cluster into a blocking race condition due to pods being terminated off different nodes causing no nodes to be able to fully shut down due to PDBs. This results in hard terminations and general cluster instability.

Describe alternatives you've considered
Using a single worker would work but it would be to slow to respond to time critical events and even for instance refresh it could be too slow for good usability.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@bwagner5 this is probably the biggest pain point with v1 and unmanaged node groups.

Has this functionality been considered for v2.0.0 or v2.x? It's still the biggest pain point with EKS that when we update our nodes the cluster often breaks.

@cjerad I've had a quick look at the v2 code (see #707 (comment)) and it looks to me like the only thing missing from the new terminator concept to solve this issue is the ability to specify a concurrency value per terminator. I'd be happy to offer this functionality as a PR or as a more details spec to be implemented?

CC @bwagner5

I'd like NTH to be able to group nodes (similar to the CA --balance-similar-node-groups) and support processing n nodes per group (this can still be constrained by the workers configuration).

  1. Can you provide more information?
  2. What do you mean by "worker"?
  3. What do you mean by "processing n nodes per group"? Do you mean terminate only n nodes per group at a time?

When using NTH to manage ASG instance refresh events it is very easy to get a cluster into a blocking race condition due to pods being terminated off different nodes causing no nodes to be able to fully shut down due to PDBs.

So pods are not terminated because that would violate the PDB, but there are no available nodes on which to launch new pods. Do I have that right? If so, how would the ability to "group nodes" and then "processing n nodes per group" address the situation? NTH cannot prevent an EC2 instance from being terminated and it cannot launch new EC2 instances. But maybe I have misunderstood something in your proposal.

@cjerad my specific scenario is we update the launch templates and the ASGs instance refresh triggers to replace the outdated nodes. This fires an event for each node being replaced with a lifecycle hook and creates new instances. NTH gets the events from all nodes and can deal with the nodes before allowing the lifecycle to finish. The problem is that when using ASG for stateful workloads you have a separate ASG in each AZ, in the above scenario that means at least 3 nodes have been sent to NTH. If you have workloads spread across these 3 nodes with PDBs requiring at least 1 running pod and all 3 nodes get actioned by NTH at the same time you end up with a race condition where all nodes are cordoned and pods have been removed from all nodes but have nowhere to go which results in all nodes stuck terminating till the new compute is available (in v1 where spot needs handling too this results in a force termination).

The simple fix is to terminate the nodes with a concurrency of 1 while sending a heartbeat back for the lifecycle hook (or having a long enough timeout), this would allow a node to be cleanly drained and terminated before moving on to the next one. The simplest implementation of this would be to just add a concurrency field to the terminator and use it for all termination types configured; this would be a great MVP solution. An improvement on top of the MVP solution would be to add a concurrencyLabelSelector field which if provided groups the nodes into buckets before applying the concurrency rules; this would allow the use of a single terminator for instance refresh across all ASGs.

The launch template update should trigger an instance refresh event. Would the ASG's minimum healthy percentage [1] setting be helpful here? If you've already tried that but still saw issues, what behavior did you see?

@cjerad there are multiple ASGs per logical "node group", which is best practice (and the only functional solution) for stateful workloads, so something in the architecture needs to group them as one when updating to not trigger the race condition.

snay2 commented

@stevehipwell Tell me if I am understanding your situation correctly:

  1. You have a stateful workload that needs to run pods in multiple AZs, balanced in roughly equal ratio across each AZ
  2. Each AZ has its own ASG, and those ASGs conceptually comprise the group of nodes that will run this workload
  3. When those pods need to be rescheduled to new nodes, you wish to maintain the balance across AZs, so that no single AZ absorbs more than its target percentage of pods for this workload. Ideally, this means that a pod will be immediately rescheduled onto a node in its same AZ (and thus same ASG).
  4. Periodically, the launch template for these ASGs gets updated, which triggers instance refresh across the related ASGs roughly simultaneously, causing some percentage of compute capacity to be removed from each ASG
  5. NTH (one deployment for the entire cluster, running in queue processor mode) responds roughly immediately to the pending termination of one or more nodes in each of these ASGs, and it attempts to cordon and drain the affected nodes, irrespective of which AZ the node is in
  6. During the interval between when ASG begins terminating the old instances and when ASG can successfully bring up the new instances, there is insufficient capacity within a given AZ to handle the pods for this workload in that AZ
  7. Because the drain respects your PDBs, and because the cluster lacks sufficient capacity to absorb the pods that need to be rescheduled, the drain on all the affected nodes stalls
  8. Eventually the lifecycle hook times out, the nodes get terminated, and the pods were unable to gracefully reschedule to new nodes in time

Are there any inaccuracies or gaps in my description? If so, will you correct them?

How many nodes are in each of these ASGs? If each ASG only has one node, then this availability concern is to be expected, per the ASG documentation:

Instances terminated before launch: When there is only one instance in the Auto Scaling group, starting an instance refresh can result in an outage. This is because Amazon EC2 Auto Scaling terminates an instance and then launches a new instance.

If that's the case, then I could understand the need to address this availability problem at a higher level than the ASG, as you are proposing.

However, if each ASG has more than one node, you may be able to set a higher minimum healthy percentage (perhaps even as high as 100%) to cause the instance refresh to proceed at a slower pace within each ASG. If so, would that alleviate the capacity constraints that are causing this issue?

Setting the minimum healthy percentage to 100 percent limits the rate of replacement to one instance at a time.

This seems like a problem better solved closer to the infrastructure level if possible, e.g., in the individual ASGs, so we are trying to understand what your situation is and what you want to accomplish. NTH intentionally does not know or care about ASGs or other conceptual node groupings at this time, and adding such awareness would be a large change that we'd need to consider carefully.

@snay2 I think you've understood the problem pretty well.

I think there are two significant parts of our architecture which makes this issue show up more clearly than for in some other use cases. Firstly our architecture is built around the principal that clusters will have nodes well balanced across AZs, via an ASG per AZ per logical group, and pods are expected to be multi replica distributed across AZs (bring on EKS v1.24 so we get topology aware hints). Secondly our clusters all run a set of "system" nodes which is often a single node per AZ (same ASG per AZ architecture) and runs stateful workloads.

I agree that an ASG native solution would be best, but scale to 0 for the ASGs behind MNGs hasn't been completed years after it was first requested so I don't think this would be realistic (see aws/containers-roadmap#1865 & aws/containers-roadmap#1866 for related requests for MNGs). I think this is very much a NTH problem, especially in it's v2 spec, as the solution wouldn't need to be ASG based as it could be based on labels only (which in our case would be seeded by the ASGs).

This is a major issue for us at the moment and there isn't a solution, so we need a fix and I think of the existing components we run on our EKS clusters NTH is the best fit. Alternatively we could write our own controller but that seems like quite a lot of duplication.