deitch/aws-asg-roller

Running asg-roller at the same time as cluster-autoscaler results in a cluster of unschedulable nodes

tom-butler opened this issue · 1 comments

I've been trying to get ASG Roller to work with Cluster Autoscaler but the two seem to be clashing and resulting in a cluster of unschedulable nodes.

I think the following is happening:

  1. ASG Roller notices difference in launch template
  2. ASG Roller scales up cluster
  3. Cluster Autoscaler notices new nodes with no usage, and taints then as PreferNoSchedule
  4. ASG Roller cordons and drains old nodes (all nodes are now unschedulable)

The issue seems to be that cluster-autoscaler taints nodes before it scales them down, the timing of the taint isn't configurable in cluster autoscaler.

Could ASG roller be updated to set the annotation "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" during scaling events?

I believe this will stop the clashing of ASG Roller and Cluster Autoscaler

Nice catch @tom-butler . The irony is that I originally wrote this while working with a prod deployment that also used cluster autoscaler. I was worried about this conflict, but in the end, that deployment didn't need the roller, while a different one, which doesn't use autoscaler, did, so the conflict just didn't happen.

Yes, I think that is the correct process, and then remove the taint when done scaling.

Care to open a PR for it?