Node termination handler drains nodes in all AZs in parallel

Question

Node termination handler drains nodes in all AZs in parallel

kaurmultani opened this issue 2 years ago · 2 comments

Describe the feature
We use instance refresh to update our 3 Auto Scaling Groups (ASGs), one for each Availability Zone (AZ).
We have node termination handler (sqs mode) for our eks cluster with a single self-managed node group.

         node_groups
          - name: wg1 
            class: self-managed
            instance_types: ["xx.large", "xx.large"]
            spot: true
            taint: false

Whenever the nodes are replaced, following happens:

first, 3 new nodes (one in each AZ) are introduced in the cluster.
then, if we dont have any PDBs in the cluster, the NTH will drain 3 old nodes (one for each AZ) in parallel.

Can we configure node termination handler not to drain the nodes in all 3 AZ in parallel?
Or is there something else I can do to achieve it?

Is the feature request related to a problem?
A description of what the problem is. For example: I'm frustrated when [...]

Describe alternatives you've considered
Using PDBs will not let all 3 nodes (one for each Availability Zone) drain parallel.

Answer 1 · 2022-08-18T19:28:47.000Z

If your concern is availability of a particular set of pods across AZs, then PodDisruptionBudget is likely the right tool for the job. You could write one PDB per application per availability zone, which would ensure you have the right number of pods running in each AZ, while regulating when the old nodes can get drained without endangering your uptime.

NTH will drain nodes as efficiently as possible, but it is always subject to any restrictions that the Kubernetes system places on the pods running on those nodes (like PDBs). NTH is too low down on the stack to know about those policies, but the "drain" commands it issues will respect the restrictions.

In general, draining nodes simultaneously across multiple availability zones is not inherently bad behavior; it could be undesirable depending on how the pods are distributed, but that is outside the scope of NTH to solve.

Answer 2 · 2022-08-23T19:36:19.000Z

@snay2 Thank you for the feedback, really appreciated!