Karpenter overestimates memory capacity of certain node types
JacobHenner opened this issue · 0 comments
Description
Observed Behavior:
Karpenter is overestimating the memory capacity of certain node types. When this happens, pods with a certain range of memory requests can trigger Karpenter scale-ups of nodes with insufficient memory for that pending pod to be scheduled. Observing that the pending pod isn't getting scheduled on the newly started node, Karpenter repeatedly attempts to scale up similar nodes with the same result.
In addition to preventing pods from scheduling, this issue has caused us to incur additional costs from third-party integrations that charge by node count, as the repeated erroneous scale ups impact node count metrics used in billing.
In our case, we noticed this with c6g.medium
instances running Bottlerocket (with AMI provided by AWS without modification). It's possible that Karpenter underestimates capacity of other instance types and distributions as well, but we have not confirmed this independently. We've also not yet compared the capacity values of c6g nodes running AL2 vs Bottlerocket.
Expected Behavior:
- Karpenter should never overestimate the capacity/allocatable of a node (using the default value of VM_MEMORY_OVERHEAD_PERCENT, at least across all unmodified AWS-provided non-custom AMI families).
- If this type of situation does occur, Karpenter should not continuously provision new nodes.
We are aware that this risk is called out in the troubleshooting guide:
A
VM_MEMORY_OVERHEAD_PERCENT
which results in Karpenter overestimating the memory available on a node can result in Karpenter launching nodes which are too small for your workload.
In the worst case, this can result in an instance launch loop and your workload remaining unschedulable indefinitely.
But I think the default should be suitable for all of the AWS-supported non-custom AMI families across instance types and sizes. If this isn't feasible, then perhaps this value should not be a global setting, and should vary by AMI family and instance type/size.
Reproduction Steps (Please include YAML):
The following steps do not reproduce the problem, but will demonstrate the issue:
- Create a EC2NodeClass and NodePool with
c6g.medium
Bottlerocket instances - Trigger scale-up of this NodePool
- Compare the capacity and allocatable values of the
NodeClaim
vs theNode
, noting that theNodeClaim
has larger memory capacity/allocatable values than theNode
object
Example from our case:
NodeClaim:
status:
allocatable:
cpu: 940m
ephemeral-storage: 89Gi
memory: 1392Mi
pods: "8"
vpc.amazonaws.com/pod-eni: "4"
capacity:
cpu: "1"
ephemeral-storage: 100Gi
memory: 1835Mi
pods: "8"
vpc.amazonaws.com/pod-eni: "4"
Node:
allocatable:
cpu: 940m
ephemeral-storage: "95500736762"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 1419032Ki
pods: "8"
capacity:
cpu: "1"
ephemeral-storage: 102334Mi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 1872664Ki
pods: "8"
Note that 1879040Ki [1835Mi] (NodeClaim) > 1872664Ki (Node).
The default value of VM_MEMORY_OVERHEAD_PERCENT
(0.075) is in use for this example.
Versions:
- Karpenter Version: 1.0.1
- Kubernetes Version (
kubectl version
): 1.28, 1.30