aws/amazon-vpc-cni-k8s

CNI init detects the wrong interface when multiple interfaces are using the same MAC address

akunszt opened this issue · 8 comments

What happened:
We created a Linux bridge on our EC2 instances named br-pod and enslaved the eth0 device into it. To keep DHCP configuration work we had to set the same MAC address to the br-pod interface. This is completely legit, in fact the bridge interface if you don't say otherwise will use the smallest MAC address of all the enslaved interfaces.

The CNI init container fetches the ENI's MAC address from the IMDS and searches for it locally. In our case it finds the eth0 instead of br-pod. This causes that the rp_filter will be configured on the wrong interface and the generated configuration will contain a wrong IP address for the nodeIP.

It is hard to prepare for every possible network configuration so we created a small patch to set the PrimaryIF using the PRIMARY_IF environmental variable. We will open a PR and I hope this approach (adding an optional environmental variable to override the auto-detect algorithm) is fine by the upstream.

The network setup looks like this on our side:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq master br-pod state UP group default qlen 1000
    link/ether 06:70:b5:b4:e5:d7 brd ff:ff:ff:ff:ff:ff
    altname enp0s5
    altname ens5
3: br-pod: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether 06:70:b5:b4:e5:d7 brd ff:ff:ff:ff:ff:ff
    inet 10.32.73.123/20 metric 1024 brd 10.32.79.255 scope global dynamic br-pod
       valid_lft 2831sec preferred_lft 2831sec
    inet6 2a05:xxxx:xxxx:xxxx:54b7:d223:2b53:f121/128 scope global dynamic noprefixroute 
       valid_lft 447sec preferred_lft 137sec
    inet6 fe80::470:b5ff:feb4:e5d7/64 scope link 
       valid_lft forever preferred_lft forever

Attach logs

time="2023-11-28T14:45:05Z" level=info msg="Copying CNI plugin binaries ..."
time="2023-11-28T14:45:05Z" level=info msg="Copied all CNI plugin binaries to /host/opt/cni/bin"
time="2023-11-28T14:45:05Z" level=info msg="Found primaryMAC 06:70:b5:b4:e5:d7"
time="2023-11-28T14:45:05Z" level=info msg="Found primaryIF eth0"
time="2023-11-28T14:45:05Z" level=info msg="Updated net/ipv4/conf/eth0/rp_filter to 2\n"
time="2023-11-28T14:45:05Z" level=info msg="Updated net/ipv4/tcp_early_demux to 1\n"
time="2023-11-28T14:45:05Z" level=info msg="CNI init container done"

What you expected to happen:
The CNI init finds to proper interface or at least we can define the primary interface.

How to reproduce it (as minimally and precisely as possible):
Create a bridge interface with the primary interface enslaved into it.

We use this systemd-networkd configuration. You might want to modify it to match your environment.
The netdev for the bridge.

[NetDev]
Name=br-pod
Kind=bridge
MACAddress=06:70:b5:b4:e5:d7

The eth0 is just one leg of the switch.

[Match]
Name=eth0

[Network]
Bridge=br-pod

The bridge interface configuration.

[Match]
Name=br-pod

[Network]
Description=Bridge for pod networking
DHCP=true
IPv6AcceptRA=true
IPv6SendRA=false
KeepConfiguration=dhcp-on-stop

[DHCP]
UseMTU=true
UseDomains=true

Anything else we need to know?:
The networkctl output.

 IDX LINK           TYPE     OPERATIONAL SETUP     
   1 lo             loopback carrier     unmanaged
   2 eth0           ether    enslaved    configured
   3 br-pod         bridge   routable    configured

Just FYI, we are doing this to be able to support dual-stack networking in our clusters so we can migrate our services to IPv6 (which will take a lot of time) and then go back to a simple IPv6-only setup.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.2
  • CNI Version: 1.15.4
  • OS (e.g: cat /etc/os-release):
NAME="Flatcar Container Linux by Kinvolk"
ID=flatcar
ID_LIKE=coreos
VERSION=3510.2.6
VERSION_ID=3510.2.6
BUILD_ID=2023-08-07-1638
SYSEXT_LEVEL=1.0
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3510.2.6 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar.org/"
BUG_REPORT_URL="https://issues.flatcar.org"
FLATCAR_BOARD="amd64-usr"
CPE_NAME="cpe:2.3:o:flatcar-linux:flatcar_linux:3510.2.6:*:*:*:*:*:*:*"
  • Kernel (e.g. uname -a):
Linux ip-10-32-73-123 5.15.122-flatcar #1 SMP Mon Aug 7 16:02:38 -00 2023 x86_64 AMD EPYC 7R32 AuthenticAMD GNU/Linux

@akunszt in general, we only want the VPC CNI to modify interfaces created by EC2 or the VPC CNI itself. The bridge interface having the same MAC as the primary ENI may lead to other issues down the road, mostly IMDS comes to mind.

I think we should instead discuss the migration of services from IPv4 to IPv6. Whether those services are deployed in a new cluster or remain in the same cluster, you can place a dual-stack load balancer in front of them so that they are reachable via IPv4 and IPv6 endpoints.

From an IPv4 cluster, pods can reach IPv6-only endpoints when ENABLE_V6_EGRESS is set to true. Similarly, in an IPv6 cluster, pods can reach IPv4-only endpoints when ENABLE_V4_EGRESS is set to true.

These are the mechanism by which other customers have performed IPv4 to IPv6 migrations, and we have end-to-end tests covering the process. If you have any questions, we can setup a longer discussion through a support case.

@jdn5126 Thank you, I am aware of those possibilities, we are AWS Enterprise customers, we had several meetings about this.
None of those solutions works in our case. We cannot change the IPv6 as it will break many of our services and we need a lot of time (months at least) to test them. The business incentive is not very tempting as we will have the same functionality as before just burning hundreds of engineering hours. I tried to convince the management but they weren't enthusiastic.
We are heavily utilizing the pod networking, using external load-balancers instead of internal Kubernetes Services would bankrupt us very quickly. It is not an option.
The ENABLE_V4_EGRESS is also not an option as we cannot run everything on IPv6 yet. Please just check vectordotdev/vector#19042 for an example.
The ENABLE_V6_EGRESS won't help us as we cannot test IPv6 compatibility at all.
Using a new, IPv6-only cluster also not an option, the inter-cluster network traffic cost would be very high.
What possible issues are you referring to? My PR modifies only the init process so it will set the proper /proc entry and fetch the real IP of the ENI. I did not change anything in the CNI plugin itself.
Also as AWS supports EC2 instances to have ENIs in different VPCs then there is non-zero chance that an instance will have two or more interfaces with the same MAC address.

I see, I am not sure I fully follow the IPv6 blockers, but as far as this issue goes, a lot of IMDS operations happen based on MACs, so I am concerned about the coverage and the implication here. We do not claim support for creating bridge interfaces, and that is not a path that we want to go down upstream. When creating the bridge interface, can't you just set the rp_filter there?

I can set the rp_filter but the NodeIP will be empty in the CNI configuration. I am not sure if that causes any issues.
The IMDS operation won't be an issue. The instance is using the bridge interface with the same MAC as the eth0 and using the same IP configuration provided by DHCP. So from the outside you cannot tell the difference at all.
As I said above the bridge interface is just our thing but you can have many interfaces with the same MAC address on the same machine. With the new multi-VPC EC2 instance setup it has a non-zero chance that someone will hit that unintentionally (not a huge chance but non-zero either). The interface assignment is usually static in the AMIs, so the first interface will be eth0, ens5 or something but always the same. Not in every AMI but what we used so far this was the case.
I am not asking to support any bridge setups, I am just asking to be able to define the primary interface manually when you know what you are doing. After that you are on your own.
I assume that the primary interface matters only in the init, right? I see references only in the cmd/aws-vpc-cni-init/main.go file. So this change does not affect how the plugin works, it just handles a new environmental variable in the init container.
(If you want to talk with us about the IPv6 migration bumps we hit so far, exchange thoughts and experiences then I am open to it but I think this is not the right place for this.)

@jdn5126 I did some extra digging. It looks like there is another "who is the primary interface" function in the pkg/networkutils/network.go file. It is a bit strange as it is used in the updateHostIptablesRules function but not in the SetupHostNetwork function. The latter one uses a hard coded eth0 and based on the name and comment this should run on the host not in the containers so it will fail if the OS uses a different name for the primary interface than eth0 - for example ens5 -, right?
I created a patch for the pkg/networkutils/network.go and I will test it in my environment.

nodeIP is used only by the chained egress-cni plugin (https://github.com/aws/amazon-vpc-cni-k8s/blob/master/misc/10-aws.conflist#L21), which you would be disabling for this use case by setting ENABLE_V[4,6]_EGRESS to false.

For the comment about multi-VPC EC2 instances, that is not something that EKS supports today, but conflicting MACs may be a problem we have to solve in the future, yes.

For the network functions, those are both required for SNAT/connection tracking through the primary ENI. Cases like these are why I worry about the implication of allowing the primary interface to be changed. EKS explicitly sets up the primary ENI and the VPC CNI installs ip rules and routes based on that contract.

Closing this issue as this is not something that we plan to implement. Feel free to open a request at https://github.com/aws/containers-roadmap/issues if you would like to push for this functionality

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.