aws/amazon-vpc-cni-k8s

IPs assignment doesn't respect maxPod configuration

qibint opened this issue · 8 comments

Description

Observed Behavior:
I've configured the pod density per node as follows for smaller aws instances(g5.xlarge, g5.2xlarge, g5.4xlarge):
podsPerCore = 4
maxPods = 20
However, the IPs assigned to these nodes(around 80-100 IPs per interface) exceed far more than the max pod number that can run on the node.
Expected Behavior:
The IP assignment(through VPC CNI) should consider the maxPod value.
As the WARM_ENI_TARGET, WARM_IP_TARGET and MINIMUM_IP_TARGET are settings from VPC CNI addon and applied across the k8s cluster level, this kubelet config at the Karpenter provisioner level is an ideal place to have these IPs assignment constrains.

Versions:

  • Chart Version: v0.34.0
  • Kubernetes Version (kubectl version): 1.25
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Moving this issue to the VPC CNI repository...

@qibint which VPC CNI addon configuration variables are you setting (warm/minimum IP targets, prefix delegation, etc)?

The VPC CNI recently implemented #2745, which seeks to solve the issue you brought up here. That PR will ship in VPC CNI v1.16.3, which is set to release soon.

Even with that PR, though, the IP allocation is not perfect, as we can over-provision depending on the mode and the amount of pod churn. I can answer more once I know your configuration setup.

Looks like I cannot move this issue, so it is staying here

v1.16.3
Hi Jeffrey. Thanks for reaching out! Here is my current vpc cni addon config:

vpc-cni = {
      addon_version = "v1.16.0-eksbuild.1"
      configuration_values = {
        env = {
          AWS_VPC_K8S_CNI_LOGLEVEL     = "INFO"
          AWS_VPC_K8S_PLUGIN_LOG_LEVEL = "INFO"
          WARM_ENI_TARGET              = "0"
          WARM_IP_TARGET               = "2"
          MINIMUM_IP_TARGET            = "18"
        }
      }
    }

With the above config, because the k8s job requests sometimes are high and it always blocks more IPs/ENIs from attaching to larger instances due to ec2 API calls throttling I think. Most of our pods are hanging in a ContainerCreating step with the message
Warning FailedCreatePodSandBox 83s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c8c5c0648cfd4a668d1e869ee0f652436b73ed38e17509f6dda1a19cbeb02878": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
Pods are assigned to those large nodes however can't get more IPs. Have to wait until those 18 IPs are released which lowers the speed of our workflow.
However, the good thing is we have more IPs available within our subnets for other jobs.

Without the above config which will use the default setting from vpc cni, all nodes large or small chosen by Karpenter just over-provisioning the IPs and can easily cause an IP shortage within our subnets.

I tried to control pod density per Karpenter provisioner, however, as I mentioned above, IPs allocation doesn't respect my density setting.

So my ideal case is to use back the default vpc cni setting, and it can be "smart" enough to choose the IP assignment based on the pod density configuration. In this way, we can have a per Karpenter provisioner setting case by case.

@qibint if EC2 API throttling were happening, you should be able to see this in your CloudWatch console or in the IPAMD logs on the node. Regardless, if you think you are close to the threshold, you can request a limit increase: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html#throttling-increase

The default VPC CNI settings will allocate IPs by allocating 1 ENI at a time, which should result in less EC2 API calls than a small warm IP target like 2, albeit at the risk of more over-provisioning.

The subnets running out of IPs can definitely cause issues, as the node would not have reached max pods, but it cannot allocate IPs to satisfy the scheduler placing pods on the nodes. #2745 will help with over-provisioning, and then #2714 is the long-term solution to resolve this.

For your case, the best short-term solution seems to be to increase the size of your subnets and to upgrade to v1.16.3 when it is released (targeting early next weeks), as it contains #2745.
And then long-term, #2714, which is targeting a VPC CNI release sometime this quarter, will make it so that you do not have to worry about the node subnets running out of IPs, as we will pull from other tagged subnets.

Thanks for this effort! Glad to see this short-term solution rolling out soon and we already increased our subnets to be able to handle more requests in our case.
Looking forward to seeing the long-term solution too as this solution will make our life much easier!
Thanks!

Thanks for this effort! Glad to see this short-term solution rolling out soon and we already increased our subnets to be able to handle more requests in our case. Looking forward to seeing the long-term solution too as this solution will make our life much easier! Thanks!

There are a lot of planned changes in this area to improve the customer experience, mostly around removing the need to ever touch or worry about the concept of warm IPs. Customers should not have to think about these things, so we gotta improve here.

Can't agree more!

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.