awslabs/amazon-eks-ami

bug(al2): max pods does not account for pod security groups VPC CNI feature

Opened this issue · 1 comments

What happened:

I have Security Groups in Pods enabled for my EKS cluster and I did it by following the AWS guide in here.

I've upgraded the AWS CNI Plugin of my EKS cluster from version v1.16.2 to version v1.18.6. And started seeing the issue. Eventually downgraded to v1.16.4 which is the release where the issue first appears.

Once I upgraded the AWS CNI plugin, some pods started to be stuck into ContainerCreating. With logs like this:

failed (add): add cmd: failed to assign an IP address to container

(Nodes have been rotated various times while testing and debugging)

After debugging and comparing with other EKS clusters running the AWS CNI plugin version v1.16.2 I noticed that, in version v1.16.4 a change was introduced that prevents pods to be assigned to the "trunk ENI".
By not using the "trunk ENI", the calculation of "max pods" per instance type changes from what is provided by default.

For a node of the type m6i.large, the EKS AMI will calculate 29 for the max-pods, which works fine if "Security Groups for EKS Pods" is not enabled.
Once SG feature is enabled (as per AWS guide linked above), the max-pods calculation needs to change and be reduced as 1 ENI is now used for "trunk interface" and cannot have IPs for non-SG pods assigned to it.

I can see the max-pods value in the Node spec in K8s

kubectl --kubeconfig /tmp/${CLUSTER_NAME}.yml get nodes ip-10-137-80-145.eu-west-1.compute.internal -o json | jq .status.allocatable                                                                                                                                                                                                                   <region:eu-west-1>
{
  "cpu": "1930m",
  "ephemeral-storage": "95551679124",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "7291784Ki",
  "pods": "29",
  "vpc.amazonaws.com/pod-eni": "9"
}

What you expected to happen:

I expected the AWS EKS AMI to correctly calculate the amount of "max-pods" it can have once "SG for EKS Pods" is enabled.

How to reproduce it (as minimally and precisely as possible):

  • Create an EKS cluster with k8s version 1.30 (I don't think k8s version matters)
  • Use latest EKS-AMI
  • Enable "SG for EKS Pods" as per AWS Guide
  • Use AWS CNI plugin version v1.16.4 or earlier
  • Scale up the cluster with pods that do not use SG and fill up the nodes
  • Some pods should be stuck in ContainerCreating once the node is almost full in terms of "Maximum pods" (not CPU or memory)

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): m6i.large
  • Cluster Kubernetes version: 1.30
  • Node Kubernetes version: v1.30.4-eks-a737599
  • AMI Version: ami-008d7732840c48377 - amazon-eks-node-1.30-v20241109

Notes

  • This bug is also described in here, but that issue was closed by the author. The other difference is that I'm not using Bottlerocket, I'm using EKS-AMI.
  • I'm not entirely sure if this bug belongs in this repo or in amazon-vpc-cni-k8s - but I have a hunch it belongs here.
  • This has nothing to do with subnets size. Subnets have plenty of available IPs
  • The bug can be triggered by using aws cni plugin v1.16.4 or earlier and fixed by using v1.16.2 (rotating nodes when switching version)
  • The AWS CNI plugin is managed as an EKS add-on
  • See below the full spec for the aws-node DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "10"
  creationTimestamp: "2023-09-06T14:42:57Z"
  generation: 10
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.16.4
    helm.sh/chart: aws-vpc-cni-1.16.4
    k8s-app: aws-node
  name: aws-node
  namespace: kube-system
  resourceVersion: "193970765"
  uid: d09631df-cc22-4dbe-9a57-fb6ccda8e4d0
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: aws-vpc-cni
        app.kubernetes.io/name: aws-node
        k8s-app: aws-node
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: kubernetes.io/arch
                operator: In
                values:
                - amd64
                - arm64
              - key: eks.amazonaws.com/compute-type
                operator: NotIn
                values:
                - fargate
      containers:
      - env:
        - name: ADDITIONAL_ENI_TAGS
          value: '{}'
        - name: ANNOTATE_POD_IP
          value: "false"
        - name: AWS_VPC_CNI_NODE_PORT_SUPPORT
          value: "true"
        - name: AWS_VPC_ENI_MTU
          value: "9001"
        - name: AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER
          value: "false"
        - name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
          value: "false"
        - name: AWS_VPC_K8S_CNI_EXTERNALSNAT
          value: "false"
        - name: AWS_VPC_K8S_CNI_LOGLEVEL
          value: DEBUG
        - name: AWS_VPC_K8S_CNI_LOG_FILE
          value: /host/var/log/aws-routed-eni/ipamd.log
        - name: AWS_VPC_K8S_CNI_RANDOMIZESNAT
          value: prng
        - name: AWS_VPC_K8S_CNI_VETHPREFIX
          value: eni
        - name: AWS_VPC_K8S_PLUGIN_LOG_FILE
          value: /var/log/aws-routed-eni/plugin.log
        - name: AWS_VPC_K8S_PLUGIN_LOG_LEVEL
          value: DEBUG
        - name: CLUSTER_NAME
          value: pgw-pre-1
        - name: DISABLE_INTROSPECTION
          value: "false"
        - name: DISABLE_METRICS
          value: "false"
        - name: DISABLE_NETWORK_RESOURCE_PROVISIONING
          value: "false"
        - name: ENABLE_IPv4
          value: "true"
        - name: ENABLE_IPv6
          value: "false"
        - name: ENABLE_POD_ENI
          value: "true"
        - name: ENABLE_PREFIX_DELEGATION
          value: "false"
        - name: VPC_CNI_VERSION
          value: v1.16.4
        - name: VPC_ID
          value: vpc-05399c0afa539058e
        - name: WARM_ENI_TARGET
          value: "1"
        - name: WARM_PREFIX_TARGET
          value: "1"
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.16.4-eksbuild.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /app/grpc-health-probe
            - -addr=:50051
            - -connect-timeout=5s
            - -rpc-timeout=5s
          failureThreshold: 3
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: aws-node
        ports:
        - containerPort: 61678
          hostPort: 61678
          name: metrics
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /app/grpc-health-probe
            - -addr=:50051
            - -connect-timeout=5s
            - -rpc-timeout=5s
          failureThreshold: 3
          initialDelaySeconds: 1
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          requests:
            cpu: 25m
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /host/etc/cni/net.d
          name: cni-net-dir
        - mountPath: /host/var/log/aws-routed-eni
          name: log-dir
        - mountPath: /var/run/dockershim.sock
          name: dockershim
        - mountPath: /var/run/aws-node
          name: run-dir
        - mountPath: /run/xtables.lock
          name: xtables-lock
      - args:
        - --enable-ipv6=false
        - --enable-network-policy=false
        - --enable-cloudwatch-logs=false
        - --enable-policy-event-logs=false
        - --metrics-bind-addr=:8162
        - --health-probe-bind-addr=:8163
        - --conntrack-cache-cleanup-period=300
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8-eksbuild.1
        imagePullPolicy: IfNotPresent
        name: aws-eks-nodeagent
        resources:
          requests:
            cpu: 25m
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
        - mountPath: /sys/fs/bpf
          name: bpf-pin-path
        - mountPath: /var/log/aws-routed-eni
          name: log-dir
        - mountPath: /var/run/aws-node
          name: run-dir
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - env:
        - name: DISABLE_TCP_EARLY_DEMUX
          value: "true"
        - name: ENABLE_IPv6
          value: "false"
        image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.16.4-eksbuild.2
        imagePullPolicy: IfNotPresent
        name: aws-vpc-cni-init
        resources:
          requests:
            cpu: 25m
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          name: cni-bin-dir
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: aws-node
      serviceAccountName: aws-node
      terminationGracePeriodSeconds: 10
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /sys/fs/bpf
          type: ""
        name: bpf-pin-path
      - hostPath:
          path: /opt/cni/bin
          type: ""
        name: cni-bin-dir
      - hostPath:
          path: /etc/cni/net.d
          type: ""
        name: cni-net-dir
      - hostPath:
          path: /var/run/dockershim.sock
          type: ""
        name: dockershim
      - hostPath:
          path: /var/log/aws-routed-eni
          type: DirectoryOrCreate
        name: log-dir
      - hostPath:
          path: /var/run/aws-node
          type: DirectoryOrCreate
        name: run-dir
      - hostPath:
          path: /run/xtables.lock
          type: ""
        name: xtables-lock
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 10%
    type: RollingUpdate
status:
  currentNumberScheduled: 2
  desiredNumberScheduled: 2
  numberAvailable: 2
  numberMisscheduled: 0
  numberReady: 2
  observedGeneration: 10
  updatedNumberScheduled: 2

+1 we're seeing this same issue in our clusters nodes (2xlarge) that have enabled Per Pod SG and have max pods (~40 pods).