deitch/aws-asg-roller

ROLLER_ORIGINAL_DESIRED_ON_TAG is not compatible with cluster-autoscaler

TwiN opened this issue · 7 comments

TwiN commented

aws-asg-roller fights with cluster-autoscaler when the latter tries to scale up and the former tries to return to the original desired count.

This should not have been enabled by default, and should've been configurable through an environment variable.

After looking at how this was implemented, it doesn't seem like it's going to be a simple task at all to make this configurable.

TwiN commented

Honestly, I'm a bit stumped. I can't think of any way to make it configurable cleanly other than reverting the entire change, as this is now used in every core functions.

On one hand, it provides a way to survive application restarts, but on the other hand, if new nodes are scheduled by cluster-autoscaler while aws-asg-roller tries to get to the desired amount, it will cause availability issues, which is not a risk I can take

@deitch @outofcoffee do you have any suggestion?

TwiN commented

Yeah that's not going to work.

The problem is that it's not really possible for aws-asg-roller to know what nodes were scaled up as a result of its change to the desired count of the ASG, or cluster-autoscaler's change -- this is because nodes are not scaled up instantly, it takes some time.

As a result, it's not possible to update the "real" original desired count with the new nodes from cluster-autoscaler, because they can't be differentiated from the ones spun up because of aws-asg-roller.

Technically, we could listen to the events created by cluster-autoscaler, but the problem is that cluster-autoscaler actually takes the nodes spun up by aws-asg-roller into consideration, meaning it may send an event that was in fact caused by aws-asg-roller.

What was nice with the previous was of calculating the "original" desired count was that it was recalculated on each run, thus taking cluster-autoscaler's changes into consideration.

Exactly. Two distinct systems fighting each other. This works fine when each works in isolation - not every ASG rolling is a k8s cluster, so not everyone has a separate autoscaler working in parallel (or more correctly, against it) - but when you put the two together, it is a problem.

They each need to find a way to do one of:

  • assume that they are "not alone in the universe" and coordinate (back-end database, messages, something)
  • have their calculations take into account the other

The problem is that each just operates on its own.

Do you recall how cluster-autoscaler does its own calculations, if it stores the "original" and how?

TwiN commented

@deitch As far as I know, cluster-autoscaler publishes its current status in a configmap named cluster-autoscaler-status, however, I don't think it reads from that ConfigMap - only writes in it.

To summarize, CA does this every scan-interval (defaults to 10 seconds):

  • Checks all events in the cluster
  • If there are any events about pods that cannot be scheduled due to lack of resources, it will look for what ASG matches what that pod needs (node selector, taint, etc.) and scales that ASG up
  • Checks if any pod could be rescheduled into a different node while keeping the average of all nodes that matches that pod's requirement below scale-down-utilization-threshold (default: 0.5, meaning 50% resources requested)
  • Checks if any node has been unneeded for scale-down-unneeded-time (default: 10 minutes)
  • If a node has been unneeded for scale-down-unneeded-time, CA begins the scaling down process (node becomes unschedulable, pods get rescheduled in different nodes)

The process above varies based on the parameters passed, but that's the gist of it.

If CA restarts, the entire process starts over, which is acceptable because the worst case scenario is that there's too many nodes (scaling up is instant, scaling down requires not needing that node for scale-down-unneeded-time), but that's completely fine because scaling down is not as critical as scaling up (too many nodes = extra cost, while too little nodes = availability issues).

As of #37, this project is no longer be compatible with cluster-autoscaler, and this is because it no longer only scales down old nodes, it can also scale down new nodes because it wants to go back to its original desired count.

aws-asg-roller's primary directive should be only to replace old nodes by new nodes, not decide the final number of nodes.

Hi folks, the original version of the #37 PR had the new behaviour hidden behind a configuration option and retained the original behaviour to preserve existing semantics.

See the references to ROLLER_ORIGINAL_DESIRED_ON_TAG in the PR description for details.

If it would help in the interim whilst the approach is figured out, we could do the same again.

TwiN commented

@outofcoffee Ah, I had seen c1e8e98 but assumed that the configurable portion of it had been removed due to added complexity (specifically for testing)

TwiN commented

Bit of an update on this, reintroducing ROLLER_ORIGINAL_DESIRED_ON_TAG just ended up surfacing new problems for me when leveraged on my clusters, and I ended up making a new ASG roller which almost handles everything differently, so while this is no longer a problem for me, I encourage others to still keep an eye out on the aforementioned problems.