alphagov/gsp

NodeGroups ASGs should be per-AZ not balanced over AZs

Closed this issue · 1 comments

What

We should have separate AutoScalingGroups for each AvailabilityZone for each NodeGroup

Why

  • AutoScaling groups attempt to "rebalance" over AZs, which is rarely what we want for our setup and can cause problems for Pods that want EBS volumes since they are AZ-bound
  • Cluster autoscaler does not support Auto Scaling Groups which span multiple Availability Zones; instead you should use an Auto Scaling Group for each Availability Zone and enable the --balance-similar-node-groups feature. If you do use a single Auto Scaling Group that spans multiple Availability Zones you will find that AWS unexpectedly terminates nodes without them being drained because of the rebalancing feature.
  • So that we can have more fine grained control over the nodes in each AZ rather than scaling up in 3x nodes each time.

More info here: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider/aws

Resolved by #600 and friends too ... we have left the kiam/ci nodes as is spanning all AZs as these are not autoscaled