bug(al2): max pods does not account for pod security groups VPC CNI feature
Opened this issue · 1 comments
What happened:
I have Security Groups in Pods enabled for my EKS cluster and I did it by following the AWS guide in here.
I've upgraded the AWS CNI Plugin of my EKS cluster from version v1.16.2 to version v1.18.6. And started seeing the issue. Eventually downgraded to v1.16.4 which is the release where the issue first appears.
Once I upgraded the AWS CNI plugin, some pods started to be stuck into ContainerCreating
. With logs like this:
failed (add): add cmd: failed to assign an IP address to container
(Nodes have been rotated various times while testing and debugging)
After debugging and comparing with other EKS clusters running the AWS CNI plugin version v1.16.2
I noticed that, in version v1.16.4
a change was introduced that prevents pods to be assigned to the "trunk ENI".
By not using the "trunk ENI", the calculation of "max pods" per instance type changes from what is provided by default.
For a node of the type m6i.large
, the EKS AMI will calculate 29
for the max-pods
, which works fine if "Security Groups for EKS Pods" is not enabled.
Once SG feature is enabled (as per AWS guide linked above), the max-pods
calculation needs to change and be reduced as 1 ENI is now used for "trunk interface" and cannot have IPs for non-SG pods assigned to it.
I can see the max-pods
value in the Node spec in K8s
kubectl --kubeconfig /tmp/${CLUSTER_NAME}.yml get nodes ip-10-137-80-145.eu-west-1.compute.internal -o json | jq .status.allocatable <region:eu-west-1>
{
"cpu": "1930m",
"ephemeral-storage": "95551679124",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "7291784Ki",
"pods": "29",
"vpc.amazonaws.com/pod-eni": "9"
}
What you expected to happen:
I expected the AWS EKS AMI to correctly calculate the amount of "max-pods" it can have once "SG for EKS Pods" is enabled.
How to reproduce it (as minimally and precisely as possible):
- Create an EKS cluster with k8s version 1.30 (I don't think k8s version matters)
- Use latest EKS-AMI
- Enable "SG for EKS Pods" as per AWS Guide
- Use AWS CNI plugin version v1.16.4 or earlier
- Scale up the cluster with pods that do not use SG and fill up the nodes
- Some pods should be stuck in
ContainerCreating
once the node is almost full in terms of "Maximum pods" (not CPU or memory)
Environment:
- AWS Region:
eu-west-1
- Instance Type(s):
m6i.large
- Cluster Kubernetes version:
1.30
- Node Kubernetes version:
v1.30.4-eks-a737599
- AMI Version:
ami-008d7732840c48377
-amazon-eks-node-1.30-v20241109
Notes
- This bug is also described in here, but that issue was closed by the author. The other difference is that I'm not using Bottlerocket, I'm using EKS-AMI.
- I'm not entirely sure if this bug belongs in this repo or in amazon-vpc-cni-k8s - but I have a hunch it belongs here.
- This has nothing to do with subnets size. Subnets have plenty of available IPs
- The bug can be triggered by using aws cni plugin
v1.16.4
or earlier and fixed by usingv1.16.2
(rotating nodes when switching version) - The AWS CNI plugin is managed as an EKS add-on
- See below the full spec for the
aws-node
DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "10"
creationTimestamp: "2023-09-06T14:42:57Z"
generation: 10
labels:
app.kubernetes.io/instance: aws-vpc-cni
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: aws-node
app.kubernetes.io/version: v1.16.4
helm.sh/chart: aws-vpc-cni-1.16.4
k8s-app: aws-node
name: aws-node
namespace: kube-system
resourceVersion: "193970765"
uid: d09631df-cc22-4dbe-9a57-fb6ccda8e4d0
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: aws-node
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: aws-vpc-cni
app.kubernetes.io/name: aws-node
k8s-app: aws-node
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- arm64
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- fargate
containers:
- env:
- name: ADDITIONAL_ENI_TAGS
value: '{}'
- name: ANNOTATE_POD_IP
value: "false"
- name: AWS_VPC_CNI_NODE_PORT_SUPPORT
value: "true"
- name: AWS_VPC_ENI_MTU
value: "9001"
- name: AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER
value: "false"
- name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
value: "false"
- name: AWS_VPC_K8S_CNI_EXTERNALSNAT
value: "false"
- name: AWS_VPC_K8S_CNI_LOGLEVEL
value: DEBUG
- name: AWS_VPC_K8S_CNI_LOG_FILE
value: /host/var/log/aws-routed-eni/ipamd.log
- name: AWS_VPC_K8S_CNI_RANDOMIZESNAT
value: prng
- name: AWS_VPC_K8S_CNI_VETHPREFIX
value: eni
- name: AWS_VPC_K8S_PLUGIN_LOG_FILE
value: /var/log/aws-routed-eni/plugin.log
- name: AWS_VPC_K8S_PLUGIN_LOG_LEVEL
value: DEBUG
- name: CLUSTER_NAME
value: pgw-pre-1
- name: DISABLE_INTROSPECTION
value: "false"
- name: DISABLE_METRICS
value: "false"
- name: DISABLE_NETWORK_RESOURCE_PROVISIONING
value: "false"
- name: ENABLE_IPv4
value: "true"
- name: ENABLE_IPv6
value: "false"
- name: ENABLE_POD_ENI
value: "true"
- name: ENABLE_PREFIX_DELEGATION
value: "false"
- name: VPC_CNI_VERSION
value: v1.16.4
- name: VPC_ID
value: vpc-05399c0afa539058e
- name: WARM_ENI_TARGET
value: "1"
- name: WARM_PREFIX_TARGET
value: "1"
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: MY_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.16.4-eksbuild.2
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- /app/grpc-health-probe
- -addr=:50051
- -connect-timeout=5s
- -rpc-timeout=5s
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: aws-node
ports:
- containerPort: 61678
hostPort: 61678
name: metrics
protocol: TCP
readinessProbe:
exec:
command:
- /app/grpc-health-probe
- -addr=:50051
- -connect-timeout=5s
- -rpc-timeout=5s
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
resources:
requests:
cpu: 25m
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_RAW
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
- mountPath: /host/var/log/aws-routed-eni
name: log-dir
- mountPath: /var/run/dockershim.sock
name: dockershim
- mountPath: /var/run/aws-node
name: run-dir
- mountPath: /run/xtables.lock
name: xtables-lock
- args:
- --enable-ipv6=false
- --enable-network-policy=false
- --enable-cloudwatch-logs=false
- --enable-policy-event-logs=false
- --metrics-bind-addr=:8162
- --health-probe-bind-addr=:8163
- --conntrack-cache-cleanup-period=300
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8-eksbuild.1
imagePullPolicy: IfNotPresent
name: aws-eks-nodeagent
resources:
requests:
cpu: 25m
securityContext:
capabilities:
add:
- NET_ADMIN
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- mountPath: /sys/fs/bpf
name: bpf-pin-path
- mountPath: /var/log/aws-routed-eni
name: log-dir
- mountPath: /var/run/aws-node
name: run-dir
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- env:
- name: DISABLE_TCP_EARLY_DEMUX
value: "true"
- name: ENABLE_IPv6
value: "false"
image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.16.4-eksbuild.2
imagePullPolicy: IfNotPresent
name: aws-vpc-cni-init
resources:
requests:
cpu: 25m
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: aws-node
serviceAccountName: aws-node
terminationGracePeriodSeconds: 10
tolerations:
- operator: Exists
volumes:
- hostPath:
path: /sys/fs/bpf
type: ""
name: bpf-pin-path
- hostPath:
path: /opt/cni/bin
type: ""
name: cni-bin-dir
- hostPath:
path: /etc/cni/net.d
type: ""
name: cni-net-dir
- hostPath:
path: /var/run/dockershim.sock
type: ""
name: dockershim
- hostPath:
path: /var/log/aws-routed-eni
type: DirectoryOrCreate
name: log-dir
- hostPath:
path: /var/run/aws-node
type: DirectoryOrCreate
name: run-dir
- hostPath:
path: /run/xtables.lock
type: ""
name: xtables-lock
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 10%
type: RollingUpdate
status:
currentNumberScheduled: 2
desiredNumberScheduled: 2
numberAvailable: 2
numberMisscheduled: 0
numberReady: 2
observedGeneration: 10
updatedNumberScheduled: 2
+1 we're seeing this same issue in our clusters nodes (2xlarge) that have enabled Per Pod SG and have max pods (~40 pods).