Karpenter overestimates memory capacity of certain node types

Question

Karpenter overestimates memory capacity of certain node types

JacobHenner opened this issue 3 months ago · 0 comments

Description

Observed Behavior:

Karpenter is overestimating the memory capacity of certain node types. When this happens, pods with a certain range of memory requests can trigger Karpenter scale-ups of nodes with insufficient memory for that pending pod to be scheduled. Observing that the pending pod isn't getting scheduled on the newly started node, Karpenter repeatedly attempts to scale up similar nodes with the same result.

In addition to preventing pods from scheduling, this issue has caused us to incur additional costs from third-party integrations that charge by node count, as the repeated erroneous scale ups impact node count metrics used in billing.

In our case, we noticed this with c6g.medium instances running Bottlerocket (with AMI provided by AWS without modification). It's possible that Karpenter underestimates capacity of other instance types and distributions as well, but we have not confirmed this independently. We've also not yet compared the capacity values of c6g nodes running AL2 vs Bottlerocket.

Expected Behavior:

Karpenter should never overestimate the capacity/allocatable of a node (using the default value of VM_MEMORY_OVERHEAD_PERCENT, at least across all unmodified AWS-provided non-custom AMI families).
If this type of situation does occur, Karpenter should not continuously provision new nodes.

We are aware that this risk is called out in the troubleshooting guide:

A VM_MEMORY_OVERHEAD_PERCENT which results in Karpenter overestimating the memory available on a node can result in Karpenter launching nodes which are too small for your workload.
In the worst case, this can result in an instance launch loop and your workload remaining unschedulable indefinitely.

But I think the default should be suitable for all of the AWS-supported non-custom AMI families across instance types and sizes. If this isn't feasible, then perhaps this value should not be a global setting, and should vary by AMI family and instance type/size.

Reproduction Steps (Please include YAML):

The following steps do not reproduce the problem, but will demonstrate the issue:

Create a EC2NodeClass and NodePool with c6g.medium Bottlerocket instances
Trigger scale-up of this NodePool
Compare the capacity and allocatable values of the NodeClaim vs the Node, noting that the NodeClaim has larger memory capacity/allocatable values than the Node object

Example from our case:

NodeClaim:

status:
  allocatable:
    cpu: 940m
    ephemeral-storage: 89Gi
    memory: 1392Mi
    pods: "8"
    vpc.amazonaws.com/pod-eni: "4"
  capacity:
    cpu: "1"
    ephemeral-storage: 100Gi
    memory: 1835Mi
    pods: "8"
    vpc.amazonaws.com/pod-eni: "4"

Node:

  allocatable:
    cpu: 940m
    ephemeral-storage: "95500736762"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 1419032Ki
    pods: "8"
  capacity:
    cpu: "1"
    ephemeral-storage: 102334Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 1872664Ki
    pods: "8"

Note that 1879040Ki [1835Mi] (NodeClaim) > 1872664Ki (Node).

The default value of VM_MEMORY_OVERHEAD_PERCENT (0.075) is in use for this example.

Versions:

Karpenter Version: 1.0.1
Kubernetes Version (kubectl version): 1.28, 1.30