aws/containers-roadmap

[EKS] [request]: Managed Nodes scale to 0

mikestef9 opened this issue ยท 218 comments

Currently, managed node groups has a required minimum of 1 node in a node group. This request is to update behavior to support node groups of size 0, to unlock batch and ML use cases.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0

This feature would be great for me. I'm looking to run GitLab workers on my EKS cluster to run ML training workloads. Typically, these jobs only run for a couple of hours a day (on big instances) so being able to scale down would make thing much more cost effective for us.

Any ideas when this feature might land?

@mathewpower you might want to use a vanilla autoscaling group instead of EKS managed.

Pretty much this issue makes EKS managed nodes a nonstarter for any ML projects due to one node in each group always being on

There is tasks now - perhaps that's the solution for this.

@jcampbell05 can you elaborate? What tasks are you referring to?

I guess that node taints will have to be managed like node labels already are in order for the necessary node template to be set: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#scaling-a-node-group-to-0.

Hey @yann-soubeyrand that is correct. Looking for some feedback on that, would you want all labels and taints to automatically propagate to the ASG in the required format for scale to 0, or have selective control over which ones propagate?

@mikestef9 If AWS has enough information to propagate the labels/taints to the ASG, then I think it'd be preferable to have it "just work" as much as possible.

There will still be scenarios where manual intervention will be needed by the consumer I think such as setting region/AZ labels for single AZ nodegroups so that cluster-autoscaler can make intelligent decisions if a specific AZ is needed, however we should probably try to minimize that work as much as possible.

@mikestef9 in my understanding, all the labels and taints should be propagated to the ASG in the k8s.io/cluster-autoscaler/node-template/[label|taint]/<key> format since the cluster autoscaler takes its decisions based on it. If some taints or labels are missing, this could mislead the cluster autoscaler. Also, I'm not aware of any good reason not to propagate certain labels or taints.

A feature which could be useful though, is to be able to disable cluster autoscaler for specific node groups (that is, not setting k8s.io/cluster-autoscaler/enabled tag on these node groups).

@dcherman isn't the AZ case already managed by cluster autoscaler without specifying label templates?

@yann-soubeyrand I think you're right! Just read through the cluster-autoscaler code, and it looks like it discovers what AZs the ASG creates nodes in from the ASG itself; I always thought it had discovered those from the nodes initially created by the ASG.

In that case, we can disregard my earlier comment.

I would like to be able to forcibly scale a managed node group to 0 via the CLI, by setting something like desired or maximum number of nodes to 0. Ignoring things like pod disruption budgets, etc.

I would like this in order for developers to have their own clusters which get scaled to 0 outside of working hours. I would like to use a simple cron to force clusters to size 0 at night, then give them 1 node in the morning and let cluster-autoscaler scale them back up.

Hi All
is this feature already for AWS EKS?
From following documentation it appears EKS supports it - From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0
Can someone please confirm?

Hi All
is this feature already for AWS EKS?
From following documentation it appears EKS supports it - From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0
Can someone please confirm?

@sibendu it's not supported with managed node groups yet (this is the object of this issue) but you can achieve it with non managed node groups (following the documentation you linked).

Would be great to have this, we make use of cluster autoscaling in order to demand GPU nodes on GKE and scale down when there are no requests. Having one node idle is definitely not cost effective for us if we want to use managed nodes on EKS

Putting use cases aside (although I have many), autoscaling groups already support min, max & desired size being 0. A node group is ultimately just an autoscaling group (and therefore already supports size 0). You can go into the AWS web console, find the ASG created for a node group and set the size to 0 and it's fine therefore it doesn't make sense that node groups are not supporting a zero size. As a loyal AWS customer it's frustrating to see things like this - there appears to be no good technical reason for preventing a size of zero but forcing customers to have a least 1 instance makes AWS more ยฃยฃยฃ. Hmmm... was the decision to prevent a zero size about making it better for the customer or is Jeff a bit short of cash?

@antonosmond there are good technical reasons why you cannot scale from 0 with the actual configuration: for the autoscaler to be able to scale from 0, one have to put tags on the ASG indicating labels and taints the nodes will have. These tags are missing as of now. This is the purpose of this issue.

@yann-soubeyrand The cluster autoscaler is just one use case but this issue shouldn't relate specifically to the cluster autoscaler. The issue should be that you can't set a size of zero and regardless of use case or whether or not you run the cluster autoscaler, you should be able to set a size of zero as this is supported in autoscaling groups.

In addition to the use cases above, other use cases for 0 size include:

  • PoCs and testing (I may want 0 nodes so I can test my config without incurring instance charges)
  • having different node groups for different instance types where I don't necessarily need all instance types running at all times
  • cost saving e.g. scaling to zero overnight / at weekends

@antonosmond if you're not using cluster autoscaler, you're scaling the ASG manually, right? What prevents you from setting a min and desired count to 0? It seems to work as intended.

@yann-soubeyrand I got to this issue from here.
It's nothing to do with the cluster autoscaler, I simply want to create a node group with an initial size of 0.
I have some terraform to create a node group but if I set the size to 0 it fails because the AWS API behind the resource creation validates that size is greater than zero.
Update - and yes I can create a node group with a size of 1 and then manually scale it zero but I shouldn't need to. The API should allow me to create a node group with a zero size.

The API should allow me to create a node group with a zero size.

I think we all agree with this ;-)

Hey guys,

is there any update on this one?

thanks!

Not having this makes it exceptionally hard migrating from one Node Group to another (we are in the process of moving to our own launch templates) without fearing breaking everything without a good rollback procedure.

I agree, this would be a great feature. Having to drain + change ASG desiredInstanceCount is tedious. I have an infrequently accessed applicaiton running on EKS that I spin up when needed, but don't need it to sit idle at 1 instnace even when not being used. Any update on timeline?

Looking to see if there's any update here?

I believe this is preventing me from having multiple different instance types across multiple different node groups. If I want to have a node group for each size of m5s, now I have to have at least 1 running for each as well even if it's unlikely that I need the 2xl or 4xl.

Adding some noise here. Spot instances was one of our hurdles (thanks for delivering!), but we are holding off on moving to managed node groups until we can be assured we won't have large, idle nodes for the sporadic bigger workloads. Any updates here would be helpful.

Yep, this is a bummer and certainly makes migrating from ASGs to managed node groups much less appealing. +1 for this feature.

Update - and yes I can create a node group with a size of 1 and then manually scale it zero but I shouldn't need to. The API should allow me to create a node group with a zero size.

When you say manually scale it to zero, do you mean literally change the desired value of the ASG after the fact? Is that permanent - at least until you re-deploy the infra? @antonosmond

@calebschoepp It seems to be so far, including after running upgrades on the nodegroups. I actually do this using a local-exec provisioner in Terraform after creation of the nodegroup for the ones that I want scale to 0 on.

We make heavy use of GPU nodes in EKS with Jupyter Notebooks that will autoscale based on requests and prune notebooks after inactivity. It makes it impossible for us to migrate to Managed Nodes as GPU instances are so expensive and we need one always on. Hoping this gets released sooner than later ๐Ÿ‘

I have unfrequent bursts of heavy workloads requiring many resources. It doesn't make sense to keep a machine running at all times.
Please make this happen!

Almost a year since this issue has been brought up. Any update or timeline on when we might have it?

Thank you!

I want to downsize dev node groups when no work is actively being done. Please add this feature!

One question we have here as we are working on this feature - we see two options when you create a node group

  1. Allow desired size and min size to be 0.
  2. Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

One question we have here as we are working on this feature - we see two options when you create a node group

  1. Allow desired size and min size to be 0.
  2. Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

Option 2 is logical and helpful.

One question we have here as we are working on this feature - we see two options when you create a node group

  1. Allow desired size and min size to be 0.
  2. Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

@mikestef9
I think it would be better to go with Option 1.

Option 2 might necessitate doing two updates via IaC tools, one to set the initial state and one to then immediately overwrite that state by setting the desired back to 0.

I think letting the users be always explicit about the minimum and desired size allows for more flexibility in configuration.

I'm welcome to hear any other thoughts/use cases!

I'm fine with either of those decisions. For nodegroups where you want to scale to 0, it's highly likely that you're using cluster-autoscaler or another autoscaler to manage the desired size, so the 10-15min that a node would exist before being destroyed is not a dealbreaker imo if it makes identifying misconfigured/unhealthy nodegroups easier.

Echoing @HTMLGuyLLC and @acesir

The GPU use case is something I'm currently using and the dev workloads is something I'm planning.

In both of these cases, having a desired count of zero to allow AutoScaling to control the desired count would be ideal.

One question we have here as we are working on this feature - we see two options when you create a node group

  1. Allow desired size and min size to be 0.

  2. Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

My personal preference would be option 1 as it doesn't force additional calls in order to downscale after cluster creation. Having said that, either option would work at this point for a much needed feature like this.

+1 for option 1, Terraform users will be much happier. It would be ok to make option 2 the default for dashboard interactions.

Either option would work, but also have a preference for option 1.

Honestly, I tried option 1 before reading and seeing that wasn't possible lol. I vote for option 1 as well.

Tried to scale to 0 and it errored about minSize so I tried to set that to zero as well (option didn't exist). Whoops.

I also lean toward Option 1, because of IAC concerns. But realistically, using Terraform and the EKS TF provider, we only set min and max sizes at creation time; we're not setting a static desired cluster size in our IAC. We just leave sizing up to cluster autoscaler and would only set desired size to 0 when we want to scale a deployed cluster down (e.g. in a nightly cron). So option 2 might also be fine?

matti commented

Option 2 would mean that you unnecessarily start an instance like c5a.24xlarge before it scales down.

GKE made the unfortunate decission to start minimum nodes with the first node pool. As well as scaleway.

Please don't do it again.

Thanks for working on this. I have a strong preference for Option 2, but would accept either option.

My opinion to justify Option 2:

  • As a cluster operator myself, it is much better to address misconfiguration issues early (fail fast). A person creating a nodegroup (of any size) will always be creating a nodegroup under the assumption that it will eventually have a size > 0. Better to test that this assumption is true, rather than be caught out later when something should scale up but doesn't due to misconfiguration.
  • In terms of IAC or Terraform picking up changes or requiring extra API calls. I don't think this is an issue. It is common to have this behaviour already with ECS Services (Terraform aws_ecs_service) or ASGs (Terraform aws_autoscaling_group) where the Terraform lifecycle is used to ignore changes to desired counts. Essentially as follows:
  lifecycle {
    ignore_changes = [desired_count]
  }
  • In terms of extra API calls to scale down after creation, I think almost anyone using this feature will be using Kubernetes cluster-autoscaler. I believe that AWS is working to make this very common add-on one of the addons available with EKS so I believe that they will have good support for this out of the box.
  • For any use-case where a nodegroup size of 0 is desired, it would likely be an environment where the operator is managing adhoc / ephemeral / expensive workloads that are not required to have capacity or be responsive 24/7 (e.g. notebook environments, batch jobs, gpu jobs, ML training, developer environments etc). In these scenarios, I would setup monitoring to ensure that the nodegroup is indeed scaling down to zero at some stage in the day or through a similar metric to ensure that the nodegroup is not sitting idle for any length of time.

Saying all of the above, perhaps there is an Option 3?

What if users were allowed to specify all 3 values (min, max, desired)? Where: min <= desired <= max.

A user wishing to ensure that no instances are created at all until requested (through cluster-autoscaler or other) could set the values:

min = 0
desired = 0
max = N
lifecycle {
  ignore_changes = [desired]
}

A user wishing to ensure that the nodegroup can join a cluster and can correctly scale up beyond 0 and then scale back down (through cluster-autoscaler or other) could set the values:

min = 0
desired = 1
max = N
lifecycle {
  ignore_changes = [desired]
}

So, if the question is around whether or not the default value for desired should be 0 or 1, I think the default value should be 1 (to avoid the misconfiguration issues mentioned above and fail-fast).

However, this still provides the option for operators to set the desired count to 0 at creation time for those who are certain they do not have a misconfiguration and who are absolutely certain they do not want a node to boot until requested. To me, that gives defaults that are operator "safe" but still gives power-users the option to opt-out of the safety if they desire.

Of course, I can understand how people might perceive this as "costing money" and would prefer the default to be 0.

I can accept that as long as it's possible to specify all 3 values, then I can opt-in to a "safe" desired count of 1 when a new nodegroup is created and the cluster-autoscaler can scale it down for me.

Our scenario is to create EKS clusters for other teams using Terraform, allowing them to choose instance types available to their workloads and managed by cluster autoscalers. There can be a significant number of groups.

A desired count of 0 is a valid scenario, and an initial count of 0 is as well. It strikes me odd I'd have to work around something for my own protection using a totally different mechanism for something that's not dangerous, only possibly perplexing. I'm not sure what that mechanism is... iterating through values making curl calls against the API once the Terraform is done, and then making sure the values are ignored?

The code should do what I ask it and the interface should communicate the rules. Option 1.

I prefer option 1 since my team creates ASGs for every possible instance type that might be required by a workload in the cluster, and we let the cluster-autoscaler scale up the more expensive types only when they are needed. We use Terraform from automation for provisioning, and our pipeline expects that we can have our desired state using a single terraform apply operation.

If there is concern about users shooting themselves in the foot, you might add an additional flag like --force-desired-zero to require them to acknowledge what they're doing.

Option 2 might be alright if the terraform-provider-aws decides to add support for option 1, though is unusual for a Terraform provider to have a different API from the upstream API.

TBBle commented

I prefer Option 1. In our rollout, we have many sets of identical node groups spread across AZs, e.g., due to EBS-CSI and Cluster Autoscaler interactions, and even if are going to run an instance to validate, we would be validating one of those identical node-group in each set, not all of them.

And we'd validate it by throwing load at it, since we're validating the scale-from-zero case, ensuring that we have the AWS tags for correct Cluster Autoscaler operation, and that won't be tested if we started with a node already running.

+1 for option 1. It's more flexible.

What a strange logic for number two.

Would you apply logic like this to normal autoscaling groups? It doesn't make much sense here either.

pre commented

If an operator wishes to go with Option 2, they can do it if Option 1 was the behaviour.

If Option 1 is not the behaviour, users are always locked into Option 2.

Option 1 allows more flexibility as it should be up-to the cluster operator to decide which way to go. With the large or specialized instance types the concern of unnecessary extra costs is a real issue which costs real money.

matti commented

Money is not an issue in this economy. Let's go with the option 2.

pre commented

Money is not an issue in this economy. Let's go with the option 2.

One could use those unnecessary instances to mine bitcoin .. ๐Ÿค”

+1 for option #1 (for many of the reasons above).

Option 1 makes more sense to me, but with cluster-autoscaler, I guess option 2 just means a few minutes with a node you don't need? Then it would go to 0 and then be fine.

Go with option 1 and leave it up to users to verify that nodes can join the cluster. If you have a bunch of node groups and never bother to verify that the nodes can join, that seems like it's operator error.

I guess option 2 just means a few minutes with a node you don't need? Then it would go to 0 and then be fine.

Yes, that's my understanding of the situation. There are good points made above that if the cluster-autoscaler is managing multiple nodegroups across AZ's, which is a recommended approach, then having to scale up a node in each AZ (which will then subsequently be scale down) isn't ideal. I think the best approach is should be Option 1, which I've convinced myself is the same as Option 3 here ๐Ÿ˜‚

matti commented

We run clusters where we have specified over 30 instance types available - going with option 2 would mean that managed node group creation creates 30 instances.

I'm guessing you aren't running with launch templates if you have a nodegroup per instance type? With launch templates you can use MixedInstancePolicy in cluster-autoscaler so it can run across mixed instances (assuming same sized, so you can ratio across reserved instances, spot instances etc).

Our nodegroups are aimed more around their purpose than the underlying instance so we avoid too much knowledge of workloads knowing much about the hardware apart from generic things such as gpu, cpu, memory requirements.

One question we have here as we are working on this feature - we see two options when you create a node group

  1. Allow desired size and min size to be 0.
  2. Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

Option 1 preferred

Appreciate all the great feedback here! I think it's pretty clear option 1 is a valid use case for many of you, and we are now leaning that way.

I did want to add a little more detail on why we were considering option 2. If you were to create a node group with desired size 0, and later that night cluster autoscaler tries to scale the node group, we have no way of proactively notifying you about issues with nodes joining the cluster. You will only be able see details by looking in console or calling the describe node group API. (In the future, we do plan to integrate with a service like EventBridge to provide this kind of functionality)

So while we will likely allow option 1, we will strongly recommend option 2 (setting desired=1 on node group creation) to check everything is working correctly when you initially deploy your node group.

pre commented

The decision on whether to verify instance launch on MNG creation should be left to the cluster-operator.

The option 1 makes option 2 possible, but option 2 does not enable option 1.

Related issue: kubernetes/autoscaler#3856

Isn't it possible to set the initial creation desired amount?
The downside of setting desired in the config is that when it is updated the desired amount is updated to that again.. That's why we don't specify desired in cloudformation config.
So when the autoscaler has adjusted the amount of nodes to say, 10, when set to 1 by an CF update it kills a lot of containers ...

Isn't it possible to set the initial creation desired amount?
The downside of setting desired in the config is that when it is updated the desired amount is updated to that again.. That's why we don't specify desired in cloudformation config.
So when the autoscaler has adjusted the amount of nodes to say, 10, when set to 1 by an CF update it kills a lot of containers ...

There is a way to do this in Terranform where you can ask it to ignore lifecycle changes to the desired field. Not sure about CloudFormation.

I believe that this is possible currently. As mentioned above you can just go into EKS -> Auto Scaling groups -> Click on the EKS scaling group you want to edit -> And click Edit under Group details. Then you can change it to 0.

For some reason, in the EKS console it limits the min value to 1, but under Auto Scaling groups you can change this down to 0.

Another question.
I created a managed nodepool which spans 3 AZ's.
When I set the MinSize to 1 I still get the MinSize 3 in the EKS cluster overview.
For this use case it does not matter which AZ the instances are as long as it is HA.
Is this also the opiniated way of NodePools to set the MinSize to 1 for each AZ?

I can edit it back to min 1 in the console but it does not happen with CF.

Btw:
desiredSize is required (Service: AmazonEKS; Status Code: 400; Error Code: InvalidParameterException; Request ID: 4f2d97c5-6589-4ba0-a94e-6fe39a1da80d; Proxy: null)
This is new ... ? someone else having this?

mmbcn commented

+1 for the option no. 1.

We are facing quite big issues cause of this. We use fluxcd and when scaling down with EKS we can only scale to 1 ( not 0 ). We do this every night and every morning we scale up. So when cluster is resting over the night 1 node is still active and flux tries to deploy stuff non stop to that one node. That eventually causes the node to be totally broken next morning. That needs manual intervention which is delete the old node so CF will recreate it. It is super annoying as we have everything automated and now that we changed to managed node groups we needed to manually intervene every morning. We switch scaling to ASG to avoid manual intervention because there you can set 0 as desired/max/min nodes.

Would be great to have this ability with eksctl!

Thx!

Many thanks for this (we also prefer option 1) - do you have any idea of timelines for enabling this?

As mentioned in #724 (comment), this is possible configured and i managed to scale up/down my nodegroup from/to 0 with cluster-autoscaler (the scale up part does not stable)

TBBle commented

The problem with that fix (directly editing the ASG) is that now the ASG doesn't match what the Managed Node Groups feature created, and it will refuse to update it if you make further changes to the Node Groups though the API.

Or at least, that's my understanding from previous discussion of the same approach to solving other missing features for Managed Node Groups. I thought that was somewhere else on this issue tracker, but can't find an actual reference right now. It might have been in the eksctl issue tracker, I guess.

@mikestef9 My team would prefer to use "Option 1" namely, simply allowing users to specify min_size and desired_size to 0, even if the respective tags, labels, and resources for the cluster-autoscaler would still need to be manually set to allow it to scale from 0. Even if this isn't ideal for a "managed node group", we would much prefer to only spin up expensive nodes when absolutely necessary.

I think option 1 makes the most sense as well, at least as far as IaC is concerned.

+1 for option 1 as this is useful for IAC and real-world use case. Any update here?

+1 for option 1.

We have settled on option 1, so you will be able to specify zero as a minimum when creating a node group. For an update on implementation, we have decided to do the work in Cluster Autoscaler itself to pull labels, taints, and extended resources directly from the managed node groups API, rather than tagging the underlying ASG. You can follow the progress here

kubernetes/autoscaler#3968

Hi, do we have any update on that?
kubernetes/autoscaler#3968 says merged.
Do we know when this will be available on EKS? Running 1.19 right now and I still have the issue.

We'll go with option 1 please.

+1 for option 1

TBBle commented

@dgoupil To be clear, kubernetes/autoscaler#3968 was just the agreement to implement a certain approach in the Cluster Autoscaler to implement this. The work itself still needs to be done (presumably is in-progress), and will require upgrading Cluster Autoscaler first once the feature is implemented, so that it can scale-up a zero-size managed node group.

So it won't appear suddenly on your cluster for you, and is definitely not guaranteed to work with a 1.19 cluster, as that depends on cluster autoscaler compatibility and/or backporting.

Anyone have a workaround for this? I tried launching the GPU instances in worker_groups, but strangely it's creating them OUTSIDE the cluster. I'm using this module for kubeflow and have no problems when I set min_size = 1. The problem is I'm trying to setup auto-scalers to keep the GPU instance count to zero unless a kubeflow gpu intensive job is running (rarely) and have the kubeflow server itself setup on m5a.2xlarge. Seems I'm waiting with everyone else for min_size = 0 support

+1 Same request here. Why in https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html, it said it's possible to scale from zero, but I cannot find any practical guide to do so?

TBBle commented

As noted at #724 (comment) you can implement this by directly editing the ASG, but that may interfere with other behaviours in Managed Node Groups.

I'd say that right now, if you want scale-to-zero, unmanaged node groups is a better, easier, and reasonably well supported option.

Well a possible temporary workaround I found is indeed using it with unmanaged node groups. I use the eks tf module: https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest

This allows for AG with 0 min and 0 desired and this is in a working state for what I use it for. If it scales up or if I manually set it to 1 after tf apply the provisioned worker node correctly joins the cluster in a working state.

Example of code here:

  worker_groups = [
    {
      name                          = "worker-group-1"
      instance_type                 = "p3.2xlarge"
      ami_id                        = "ami-05685e4301fc45b62"
      spot_price                     = "4.00"
      asg_desired_capacity          = 0
      asg_min_size                  = 0
      asg_max_size                  = 5
      root_volume_size              = 100
      additional_security_group_ids = [aws_security_group.worker_group_1.id]
      root_volume_type              = "gp2"
      kubelet_extra_args  = "--node-labels=gpu-count=1"
      # kubelet_extra_args  = "--node-labels=gpu-count=1 --node-labels=lifecycle=Ec2Spot --node-label=aws.amazon.com/spot=true --register-with-taints=spotInstance=true:PreferNoSchedule"
      tags = [
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/label/lifecycle"
          "value"               = "Ec2Spot"
          "propagate_at_launch" = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot"
          "value"               = "true"
          "propagate_at_launch" = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/label/gpu-count"
          "value"               = 1
          "propagate_at_launch" = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "value"               = "true"
          "propagate_at_launch" = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/taint/spotInstance"
          "value"               = "true:PreferNoSchedule"
          "propagate_at_launch" = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/kubeflow"
          "value"               = "owned"
          "propagate_at_launch" = "true"
        }
      ]
    },
    
  ]

    node_groups = [
         {
      name                    = "ng-1"
      capacity_type           = "SPOT"
      instance_types          = ["m4.large", "t3.large"]
      spot_price              = "4.00"
      min_capacity    = 1
      desired_capacity = 3
      max_capacity = 10
      disk_size = 100
      disk_type = "gp2"
      k8s_labels = {
      "node-class" = "worker-node"
}
      kubelet_extra_args  = "--node-class=worker-node --node-labels=lifecycle=Ec2Spot"
      tags = [
        {
          "key"                 = " k8s.io/cluster-autoscaler/node-template/label/lifecycle"
          "value"               = "Ec2Spot"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/label/aws.amazon.com/spot"
          "value"               = false
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/label/gpu-count"
          "value"               = 0
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "value"               = true
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/kubeflow"
          "value"               = "owned"
        }
      ]
    },

Another vote for Option 1. The work has been done in cluster-autoscaler project but AWS still need to allow 0 for desired/min size. Please can you action asap?
Then terraform has a ticket to make the change on their side terraform-aws-modules/terraform-aws-eks#1233
Thank you!

Can we get a status update on this issue? It's been 16.5 months since the issue was opened, and 4.5 months since the work item was picked up and the community was asked about the approach. There is a lot of community interest in this capability and it's a major blocker to adoption of Managed Nodes.

Just looking for a ballpark here, like is it coming in Summer 2021? 2021 at all?

Hey all,

Some good progress to share here. The manage node groups API now accepts a value of zero for both minimum and desired size. This unblocks users who have applications relying on basic node attributes discovery in cluster autoscaler including CPU, memory, and GPU. Additionally, you can manually scale up and down to zero for jobs that are only run during business hours.

The next phase is to implement the cluster autoscaler changes as proposed and accepted here. This will allow you to scale up from zero based on pods that have node selectors, taints, or requests for extended node resources. We'll leave this issue open as we continue the upstream work on cluster autoscaler.

Am I correct to assume that, even before the proposed autoscaler change, we can now scale up from zero (based on node selectors and taints) if we tag our managed node groups with k8s.io/cluster-autoscaler/node-template/label and k8s.io/cluster-autoscaler/node-template/taint ?

sylr commented

Am I correct to assume that, even before the proposed autoscaler change, we can now scale up from zero (based on node selectors and taints) if I tag my managed node groups with k8s.io/cluster-autoscaler/node-template/label and k8s.io/cluster-autoscaler/node-template/taint ?

Yes, absolutely, I do it already. But you'll have to tag the ASG, not the managed node groups.

sylr commented

Actually I currently tag the EKS node groups witht the cluster-autoscaler tags and then run a script to transfer those tags to the associated ASGs:

export AWS_PROFILE=xxxxxxx
K8S_CLUSTER=eks-euw1-01
DRY_RUN=echo

for nodegroup in $(aws eks list-nodegroups --cluster-name ${K8S_CLUSTER} | jq -r '.nodegroups[]'); do
  echo $nodegroup

  nodegroup_desc=$(aws eks describe-nodegroup --cluster-name ${K8S_CLUSTER} --nodegroup-name ${nodegroup})
  nodegroup_asg=$(jq -r '.nodegroup.resources.autoScalingGroups[0].name' <<<"${nodegroup_desc}")
  nodegroup_tags=$(jq -r '[ .nodegroup.tags | to_entries[] | (.key + "=" + .value) ] | .[] | select(. | contains("k8s.io/cluster-autoscaler/node-template"))' <<<"${nodegroup_desc}")

  for tag in ${nodegroup_tags}; do
    key=$(cut -d "=" -f1 <<<"${tag}")
    value=$(cut -d "=" -f2 <<<"${tag}")
    ${DRY_RUN} aws autoscaling create-or-update-tags --tags "ResourceId=${nodegroup_asg},ResourceType=auto-scaling-group,PropagateAtLaunch=true,Key=${key},Value=${value}"
  done
done

Set DRY_RUN= empty to actually enforce the tag creation.

Thanks @sylr, very informative. Guess it would have been much easier if node groups just propagated tags to ASGs. The workaround might not be as straightforward from terraform.

I wonder why the autoscaler proposal description mentions "Many tags are already added to the ASG by standard components like the AWS cloudprovider for Kubernetes and by customers for billing and cost association purposes", but still we are unable to tag ASG for billing/other purposes in an automated way (this issue remains opened #608 (comment)). I understand that a complete automated way may need the 50 tag limit solution, but why not allow us to propagate those custom tags meanwhile.

So v3.46.0 was released with the ability to specify 0 for managed node groups, is there any thing left pending?

@stevehipwell

The next phase is to implement the cluster autoscaler changes as proposed and accepted here. This will allow you to scale up from zero based on pods that have node selectors, taints, or requests for extended node resources. We'll leave this issue open as we continue the upstream work on cluster autoscaler.

@mikestef9 is there a reason why the managed node group can't create the relevant cluster-autoscaler ASG tags from the managed labels and taints? The current behaviour makes managed node groups un-usable in all but the simplest scenarios and the implementation of this should be trivial even if you add a flag to enable this.

Rather than copying over labels and taints as tags to the ASG, we decided to instead make upstream cluster autoscaler changes to pull directly from the managed node group. That is the reason this ticket is still open, as we are actively implementing those changes.

As a workaround for now, the name of underlying ASG is returned as part of the DescribeNodegroup API, so you could retrieve the ASG name and manually add the necessary cluster autoscaler tags.

jbg commented

For those using Terraform, I submitted a PR for a new Terraform resource which can tag an existing ASG; you can get the ASG name as an attribute of the node group and then use this resource to add the needed tags to it.

@mikestef9 that sounds pretty complex and AWS specific as well as overkill for the 80% use case; what were the reasons for not simply creating tags based on the managed node group labels and taints args? Please note I'm specifically not talking about copying tags here, that's another discussion for another use case.

RE the manual action, that's not really a viable option in an IaC system such as terraform. @jbg how does your new resource deal with the ASGs being deleted when the node group is updated?

jbg commented

@stevehipwell the resource just takes an ASG name as input. You can get a list of the ASG names related to a node group as an attribute on that node group. If the list changes, the tag resources will be recreated as needed, same way any dependencies work in terraform. It works fine.

@jbg I was just checking that when the list of ASGs changed that the resource wouldn't try and remove the tag from the now deleted ASG. This looks like a solution to non cluster-autoscaler tagging requirements and a stop gap for cluster-autoscaler ones.

There's an inherit race condition when you consider scaling activities concurrent with ASG replacement. With all of the async options with IAC (CloudFormation + Lambda or Terraform + null_resource/lambda/new module), the MNG has to be fully updated first before the trigger to add tags to the underlying ASGs can be triggered.

Cluster autoscaler may do odd things depending on how long that operation takes. The ability to update > 1 node concurrently helps reduce the period in which this may occur. If CA just understands MNG directly, it's more likely to behave as expected earlier on in the process.

jbg commented

It actually would, but the failure to remove in that case is handled as a success (in common with the way other similar TF resources that deal with applying things to "implicitly created" resources โ€” like aws_ec2_tag โ€” handle things being removed out from under them)

@justin-watkinson-sp I get the principal, it's just the time being taken to come up with the gold plated solution is leaving us with far more race condition issues due to there being no solution. Implementing the mvp solution in good time would allow better analysis for the gold plated version while allowing most people to get some or all the benefits.

Adding to the points I made above I've got a couple of other observations that I think warrant discussing further.

  • How does completely changing how the cluster-autoscaler works for EKS managed node groups keep consistent behaviour between EKS managed and un-managed node groups let alone other Kubernetes distros? One of the driving forces behind Kubernetes is the assumption (not always correct) of compatibility between vendors and distros, this seems to be causing a greater difference without the possibility for anyone else to bridge the gap. The other solution proposed sounds like a much better option and the reason against doesn't stack up; instead of the managed node group writing the information about itself to the cluster in a generic way that cluster-autoscaler understands (no knowledge of cluster-autoscaler needs to leak into the managed node group), the managed node group implementation is leaking into cluster-autoscaler which needs a custom driver just for managed node groups. Why not create a node group CRD that is generic and that cluster-autoscaler can understand, initially this can be used to replace the tag pattern but in the future it can be used to improve the logic such as similar groups and upgrade lifecycles.
  • What is the realistic delivery time of the new pattern including the work, testing and EKS compatibility? Even if we ignore the development time for this we're still looking at something that's unlikely (I'd love to be proven wrong) to be back ported in a project that's at least one version of K8s ahead of the latest EKS K8s version; how long will it take for EKS to catch up to K8s versions supported by this new pattern?
TBBle commented

The MachinePool feature in Cluster API already is a "node group CRD" like you're talking about, but there's a fair bit of work ahead before it becomes useful. Particularly, Cluster Autoscaler doesn't yet support scale-to-zero for Cluster API-based clusters, and I don't think Cluster Autoscaler's Cluster API support has integrated Machine Pools yet anyway.

It's a long way from practically useful.