aws/amazon-vpc-cni-k8s

aws eniconfig not being honored and pods using trunk instead of ENI

sstarcher opened this issue ยท 24 comments

What happened:
Upgraded from aws-vpc-cni v1.12.6 to v1.16.0. Pods sometimes get assigned to the trunk interface instead of to the ENI. This causes them to not get the correct security groups from the ENIConfig. A small sample size this seems to be pods that got assigned to the node just as it is coming up.

Attach logs

snippet of logs remainder sent

aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.381Z","caller":"ipamd/ipamd.go:822","msg":"Found ENI Config Name: eni-config-ds-subnet-0318d75ae06a34052"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"ipamd/ipamd.go:793","msg":"ipamd: using custom network config: [sg-066233d33bbd94a21 sg-03d0fde3a6a691a6d], subnet-0318d75ae06a34052"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"awsutils/awsutils.go:728","msg":"Trying to allocate 10 IP addresses on new ENI"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"awsutils/awsutils.go:728","msg":"Using a custom network config for the new ENI"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"awsutils/awsutils.go:728","msg":"Creating ENI with security groups: [sg-066233d33bbd94a21 sg-03d0fde3a6a691a6d] in subnet: subnet-0318d75ae06a34052"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.910Z","caller":"awsutils/awsutils.go:728","msg":"Created a new ENI: eni-0a9379b23fe4ae3e1"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:14.216Z","caller":"ipamd/ipamd.go:838","msg":"Successfully created and attached a new ENI eni-0a9379b23fe4ae3e1 to instance"}

aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"ipamd/ipamd.go:1097","msg":"Added ENI(eni-0a9379b23fe4ae3e1)'s IP/Prefix 10.110.130.178/32 to datastore"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"aws-k8s-agent/main.go:91","msg":"Serving RPC Handler version on 127.0.0.1:50051"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"runtime/asm_amd64.s:1650","msg":"Serving metrics on port 61678"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"ipamd/introspect.go:54","msg":"Serving introspection endpoints on 127.0.0.1:61679"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"runtime/asm_amd64.s:1650","msg":"Setting up shutdown hook."}
aws-node-f5x7z aws-node time="2024-02-13T13:19:15Z" level=info msg="Copying config file... "
aws-node-f5x7z aws-node time="2024-02-13T13:19:15Z" level=info msg="Successfully copied CNI plugin binary and config file."
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:16.286Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /var/run/netns/cni-2b9d3041-7418-d61e-ac01-8fd27033c5c1, Sandbox cae024af5e5ae0dcb7c76f9496f018620d59c5671e33ed684e02040e6b40628d, ifname eth0"}

aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:38.624Z","caller":"ipamd/ipamd.go:1097","msg":"Adding 10.110.140.234/32 to DS for eni-03928b37b5d4f3d56"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:38.624Z","caller":"ipamd/ipamd.go:1097","msg":"Added ENI(eni-03928b37b5d4f3d56)'s IP/Prefix 10.110.140.234/32 to datastore"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:43.953Z","caller":"datastore/data_store.go:714","msg":"assignPodIPAddressUnsafe: Assign IP 10.110.140.234 to sandbox aws-cni/f05f22e23f215ad08041ffd6c25663eeaca757df843f0571920e15714ce7f683/eth0"}
aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:43.974Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr 10.110.140.234, IPv6Addr: , DeviceNumber: 7, err: "}

What you expected to happen:

Pod to be correctly assigned to an ENI with the correct security groups

How to reproduce it (as minimally and precisely as possible):

We set the following settings and in addition use calico

    --set env.MAX_ENI=${MAX_ENI_PER_WORKER:-3} \
    --set env.WARM_IP_TARGET=${WARM_IPS_PER_WORKER:-1} \
    --set env.MINIMUM_IP_TARGET=${MIN_IPS_PER_WORKER:-3} \
    --set env.WARM_ENI_TARGET=${WARM_ENI_PER_WORKER:-1} \
    --set eniConfig.region=${region} \
    --set image.region=${region} \
    --set init.image.region=${region} \
    --set nodeAgent.image.region=${region}

priorityClassName: "system-node-critical"
env:

see # https://github.com/aws/amazon-vpc-cni-k8s/blob/7ab227ecbd14623456ea794e893696c2bd66f2b9/README.md

AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true"
AWS_VPC_K8S_CNI_EXTERNALSNAT: "true"
ANNOTATE_POD_IP: "true"
ENABLE_POD_ENI: "true"
POD_SECURITY_GROUP_ENFORCING_MODE: "standard"

AWS_VPC_K8S_CNI_LOGLEVEL: "INFO"
AWS_VPC_K8S_PLUGIN_LOG_LEVEL: "INFO"

AWS_VPC_K8S_CNI_LOG_FILE: "stdout"

readinessProbe:
initialDelaySeconds: 5

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): - v1.27.9-eks-5e0fdde
  • CNI Version
  • OS (e.g: cat /etc/os-release):
    NAME="Amazon Linux"
    VERSION="2"
    ID="amzn"
    ID_LIKE="centos rhel fedora"
    VERSION_ID="2"
    PRETTY_NAME="Amazon Linux 2"
    ANSI_COLOR="0;33"
    CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
    HOME_URL="https://amazonlinux.com/"
    SUPPORT_END="2025-06-30"
  • Kernel (e.g. uname -a): Linux ip-10-110-140-54.ec2.internal 5.10.205-195.807.amzn2.x86_64 #1 SMP Tue Jan 16 18:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

@sstarcher it looks like you are using Security Groups for Pods with Custom Networking. In this case, it is the VPC Resource Controller which will handle ENI allocation and pod placement.

If the pod matches a Security Group policy, then it will be annotated by the VPC Resource Controller and placed behind the trunk ENI. If the pod does not match a Security Group policy, then it will be placed behind a regular ENI, which was allocated based on the Custom Networking spec.

Did you terminate your nodes after setting ENABLE_POD_ENI? This is a required step for Security Groups for Pods, as the controller needs to be able to build its internal state properly. I am not aware of any race conditions. If the pod matches a Security Group policy, it should match every time.

Also I see that you are using Calico. You are just using Calico for network policy, right? As we may need controller logs to debug this further, I suggest opening up an AWS support case.

Thanks I'll open a support ticket. All of the settings have been in place for months. The only change here is the chart version.

  • the pod I'm running does not have a security group policy
  • I expect it to get the nodes eni config
  • It does not get the nodes eni config
  • it gets the default security group and not what is listed in eniconfig (sometimes)

The VPC Resource Controller is that embedded in somehow? We are using Security Groups for Pods and have been for a while, but only have this amazon-vpc-cni-k8s helm chart installed.

The VPC Resource Controller runs in the EKS-managed control plane. The major change between v1.12.6 and v1.16.0 is the VPC CNI using the CNINode CRD to communicate with the controller instead of the vpc.amazonaws.com/has-trunk-attached node label.

If no Security Group policy matches this pod, then it should not be annotated and placed behind the trunk ENI. When you describe the pod, do you see an annotation with vpc.amazonaws.com/pod-eni? If not, how do you know that it is behind the trunk ENI?

I'll have to recreate to check the annotation. I found it using the trunk ENI because I took the pod IP and searched the interfaces. I found the working pods had an interface where the non-working used trunk and noticed the security groups were wrong.

TRUNK Does Not work aws-k8s-trunk-eni eni-02eb9c28496b9765d - 5 IPs
ENI - Does work aws-K8S-i-03c758f943b4ca6d3 eni-03edf562fcac0125d - 14 IPs

^ those interfaces were both defined for the same node.

We are facing a similar problem: our nodes are launched in public subnets, with ENIConfigs defining private subnets and Security Groups for pods. As per AWS EKS addon config:

AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
ENABLE_PREFIX_DELEGATION           = "true"
ENABLE_POD_ENI                     = "true"
POD_SECURITY_GROUP_ENFORCING_MODE  = "strict"
AWS_VPC_K8S_CNI_EXTERNALSNAT       = "true"

With this configuration, ipamd v1.15.5 works as expected - all ENIs except for the first one are placed in the subnets and get SG assigned as specified by ENIConfig.

In v1.16.2, the second ENI described as aws-k8s-trunk-eni is always created in the same subnet as the node itself and gets the node's SG attached (i.e. ENIConfig is ignored), but the third ENI named aws-K8S-${instance_id} is configured as expected. Since we have AWS_VPC_K8S_CNI_EXTERNALSNAT=true, pods allocated to the second interface (the one not configured as ENIConfig prescribes) are not SNATed and effectively don't have Internet access.

@stanvit when Custom Networking is configured, the primary ENI is unused. When Security Groups for Pods is configured, the trunk ENI, aws-k8s-trunk-eni, will always have the same Security Group as the primary ENI, as this is required for trunking to be setup properly. Subsequent ENIs will use the ENIConfig CRD, so they will have the Security Group defined in the CRD.

As an aside, when Security Groups for Pods and Custom Networking are configured, ENIs are attached by the VPC Resource Controller.

Regarding AWS_VPC_K8S_CNI_EXTERNALSNAT=true, configuring this means that pod traffic external to the VPC does not get SNAT'ed on the node, as the expectation is that you have configured VPC routes to force it through an egress gateway or NAT.

Everything you described sounds to me like it is working correctly, so I am wondering if you did not mean to configure AWS_VPC_K8S_CNI_EXTERNALSNAT=true? With that set to false, pod traffic from ENIs attached by ENIConfig destined to the Internet would SNAT through the node's primary IP.

I have also verified that 1.15.5 also works for us where 1.16.0 does not. I will be testing 1.16.2 soon.

I have also verified that 1.15.5 also works for us where 1.16.0 does not. I will be testing 1.16.2 soon.

Hmm.. that is very strange. v1.16.0 did add IPv6 Security Groups for Pods support, but IPv4 should not have been affected. I see all tests passing without issue. Lmk what you find, and I think we will need controller logs, so we will definitely want to go through the support case.

@jdn5126, thanks for your answer

I ran a few tests on our cluster where we have both Prefix Delegation and Pod ENI enabled, and the problem boils down to this:

  • on small instances supporting only 2 ENIs (r7a.medium in our case), there's no way to have both trunk and PD interface at the same time, so prefixes are assigned directly to the trunk interface. When the instance is supporting 3+ ENIs, prefixes are never assigned to the trunk while subsequent interfaces are used for prefix delegation
  • in v1.15.5, ENIConfig is used for the trunk interface (both Subnet and SGs), but in v1.16.2, as you just described, trunk ENI is placed in the same subnet and gets the same Security Groups as the node

The changed behaviour is problematic for us, as pods that are not using dedicated ENIs are assigned different security groups and subnets depending on the instance type they are launched on. We were using with this configuration for over a your now, the issue was introduced in v1.16.0.

AWS_VPC_K8S_CNI_EXTERNALSNAT=true is intentional as our pods are running in separate private subnets routed through NAT Instances/Gateways

I collected logs for my four test cases (v1.15.5/r7a.medium, v1.15.5/r7i.large, v1.16.2/r7a.medium, v1.16.2/r7i.large) with aws-cni-support.sh, may provide them if there's interest

I have opened a support ticket waiting for it to be escalated.

@stanvit Custom Networking + Security Groups for Pods cannot work properly on instances with that support only 2 ENIs, so this makes sense, but the prefixes being assigned to the trunk ENI part should not happen. Did you terminate the nodes after enabling prefix delegation?

I spun up a cluster using v1.15.5, and I do not see the behavior you described, i.e. the trunk ENI has the same security group as the primary ENI, so we are missing something here. Still digging...

@stanvit if you email the node logs to k8s-awscni-triage@amazon.com, we can take a look at them

@jdn5126 thanks for the email, I just sent all setup details and logs

the prefixes being assigned to the trunk ENI part should not happen

Thinking about this, you're right, but our setup worked like that up until recently.

Did you terminate the nodes after enabling prefix delegation?

I never disabled it, but yes, I was draining and letting nodes to be recreated after every vpc-cni version update

I spun up a cluster using v1.15.5, and I do not see the behavior you described, i.e. the trunk ENI has the same security group as the primary ENI, so we are missing something here

I sent my logs, so hopefully it sheds some light on the issue.

While we're at it: if prefix delegation on trunk interfaces is problematic, is it possible to prevent the trunk interface from being created on the instances with only two ENIs and custom networking enabled, or have some other way to disable trunking on certain nodes by, say, setting vpc.amazonaws.com/has-trunk-attached: false upon node creation?

We would like to keep using Custom networking for pods, Prefix Delegation, have the ability to use Security groups for pods occasionally, and use smaller instances where possible for cost savings.

@stanvit sorry for the delay, I will share my findings here:

r7a.medium - In v1.15.5 and v1.16.2, we are not properly skipping trunk ENIs when determining which ENIs we can allocate new IPs/prefixes to: https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.15.5/pkg/ipamd/datastore/data_store.go#L1059 . Since we allocate prefixes on the trunk ENI and add them to the datastore, we start placing pods behind the trunk ENI, which seems like a bug. I think this is a general issue, but it manifests quickly when Security Groups for Pods and Custom Networking are configured and there are only two ENIs.

On the Security Group front, looking at https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.15.5/pkg/awsutils/awsutils.go#L528, we do not touch the Security Groups of ENIs when Custom Networking is enabled. This makes sense to me, as we are relying on the ENIConfig to control the SGs for attached ENIs, and we are relying on the VPC Resource Controller to control the SG for the trunk ENI: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/provider/branch/trunk/trunk.go#L221. The VPC Resource also looks for the ENIConfig, so the only thing that would make sense to me here is that you did not terminate the node after enabling Custom Networking, hence the race condition on what SG was used for the trunk ENI.

Can you try terminating the nodes and try validating the SG on the trunk ENI afterward?

I see a similar story for r7i.large, so that leads me to the following conclusions:

  1. We need to fix https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/datastore/data_store.go#L977 to properly skip trunk interfaces.
  2. When Custom Networking and Security Groups for Pods are configured, an instance with 2 ENIs can only support pods matching security group policies. Since no "normal" (not primary ENI or trunk ENI) ENIs can be attached, the instance has no IPs available for normal pods.
  3. Whenever Custom Networking and/or Security Groups for Pods is configured, the instances need to be terminated, otherwise there is a race condition on what Security Group will be assigned to the trunk ENI.

Internally, I am working with the EKS Networking team to determine what to do about number 1

@stanvit sorry, I forgot to address the other threads:

I never disabled it, but yes, I was draining and letting nodes to be recreated after every vpc-cni version update

Terminating is definitely a requirement, as draining will not detach the trunk ENI.

is it possible to prevent the trunk interface from being created on the instances with only two ENIs and custom networking enabled, or have some other way to disable trunking on certain nodes by, say, setting vpc.amazonaws.com/has-trunk-attached: false upon node creation?

It is possible, but we would need to track this as a new feature request. The request would get more visibility if added at https://github.com/aws/containers-roadmap/issues.

@jdn5126, thanks for your answers

The VPC Resource also looks for the ENIConfig, so the only thing that would make sense to me here is that you did not terminate the node after enabling Custom Networking, hence the race condition on what SG was used for the trunk ENI.

I never disabled Custom Networking, if by enabling/disabling you mean changing AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG value and restarting aws-node stateful set. It was always enabled during the testing

Can you try terminating the nodes and try validating the SG on the trunk ENI afterward?

Done multiple times, the behaviour is reproducible: with v1.15.5, the trunk interface is configured accordingly with the Custom Networking settings - both SGs and Subnets. I'll collect more data and send it tomorrow (aws describe instances output etc)

I think I might just got onto something though: I tried to label new nodes with vpc.amazonaws.com/has-trunk-attached: "true" hoping to trick VPC Resource Controller into not attaching a Trunk ENI. That didn't help, but instead made v1.15.5 to behave as v1.16.2: the trunk interface still got created though this time it ignored Custom Networking.

@stanvit ok, I think I finally have more of the story. So first, the desired behavior:

  • The VPC Resource Controller should create the trunk ENI with the Security Group specified in the ENIConfig

This happens in v1.15.5, but in v1.16.2, the Security Group for the trunk ENI may be the same as the one for the primary ENI.

What changed?

  • In v1.15.x+, the VPC CNI communicates with the VPC Resource Controller via the CNINode CRD (no longer uses node labels). In v1.15.5, the VPC CNI enabled features in the following order: Custom Networking then Security Groups for Pods. In v1.16.2, the order of enabling is reversed (#2639 is the PR that reversed it).
  • So the VPC Resource Controller is creating the trunk ENI before registering that it needs to use the ENIConfig. This shouldn't be a problem on new instances, and I have not been able to reproduce it, but I can see that it happened in your case.

This seems like a general reconciliation issue with VPC Resource Controller, so I am engaging that team now.

For the other problem, where prefixes were assigned to the trunk ENI, #2801 should fix that. I spun up a cluster with Custom Networking and Security Groups for Pods and validated it.

@jdn5126, thanks for the update

  • Do you need any more experiments/details from my side?
  • So, as I understand it, after #2801 is merged, the prefixes won't be delegated to the Trunk Interfaces and instances that are limited to 2 ENIs won't be able to use Custom Networking and Security Groups for pods at the same, preventing "normal" pods from launching?

@jdn5126, thanks for the update

  • Do you need any more experiments/details from my side?
  • So, as I understand it, after Do not allocate IPs or prefixes to trunk ENIs or EFA ENIs #2801 is merged, the prefixes won't be delegated to the Trunk Interfaces and instances that are limited to 2 ENIs won't be able to use Custom Networking and Security Groups for pods at the same, preventing "normal" pods from launching?

I spoke to the VPC Resource Controller team and finally have the full story here. To fix the regression between v1.15.5 and v1.16.x, I am going to revert the order in which IPAMD enables features (Custom Networking before Security Groups for Pods). In parallel, the VPC Resource Controller team is going to explore options to reconciling and updating the Security Group for the trunk ENI, as today the trunk ENI Security Group cannot be changed after creation. So if you change the ENIConfig, the trunk Security Group will not be updated on existing nodes.

For the second part, instances with only 2 ENIs, I am discussing with our Project Manager whether we can mark these instances as invalid for trunk ENIs, so that only "normal" pods will be scheduled on them. If approved, we would treat this as an enhancement to an existing feature.

The VPC CNI fix will go in soon, and will target v1.16.4, which is scheduled to release in early to mid March. We do not need any more details from your end, as we are able to reproduce and understand the issue now. Thank you so much for your patience and help!

#2801 contains fixes for two of the issues mentioned here:

  • assigning prefixes to trunk ENIs
  • not creating the trunk ENI with the correct Security Group

I filed aws/amazon-vpc-resource-controller-k8s#373 to cover updating the trunk ENI Security Group when the ENIConfig object changes.

For instances that can support only two ENIs, we are still determining whether it is ok to mark these instances as not eligible for Security Groups for Pods when Custom Networking is enabled.

Closing this issue as the fix has merged and will ship in v1.16.4 early next week. #2818 also provides integration test coverage to prevent regressions.

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.