Multiple ENIs is confusing cloud-provider-aws controller
MadJlzz opened this issue · 9 comments
What happened:
I am working on deploying a Kubernetes cluster using cluster api and amazon-vpc-cni
as the network manager of the cluster.
During my tests I observed a pretty strange behaviour of the cloud-provider-aws
controller.
In fact, the Kubernetes node object internal IP changed from the private IP of the EC2 ENI (the one provisioned alongside the creation of instance) to the private IP of the ENI that was provisioned by the AWS VPC CNI controller. During my tests, I also saw that this behaviour was quite random.
It leads to a lot of problems such as kubectl
not being able to send back results of commands such as kubectl logs
or kubectl exec
since kube-apiserver
is forwarding those requests to the node hosting the pod using its internal IP fetched from the Node
resource.
What I cannot explain though, is why this secondary private IP attached to the same instance is not answering properly those calls even though the firewall was allowing any kind of traffic from any source.
I've implemented a workaround to this issue by simply getting the primary IP of the node during runtime and passing the flag --node-ip
to the kubelet before actually starting it.
To be sure that cloud-provider-aws
don't override what I did, I've also set --allocate-node-cidrs=false
flag.
What you expected to happen:
Once the Node
object internal IP is set ; it should not be replaced by the one of the other ENI. Or using the other IP should not be a problem and then this issue is becoming a networking problem for the CNI team.
Anything else we need to know?:
Here's a screenshot that exposes the behaviour. The top pane shows it initially and the second pane the changed IPs after I've deployed the aws-vpc-cni
+ cloud-provider-aws
controller.
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0
- Cloud provider or hardware configuration:
v1.29.1
- OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
- Kernel (e.g.
uname -a
):
Linux ip-10-0-1-97 5.15.0-1056-aws #61~20.04.1-Ubuntu SMP Wed Mar 13 17:40:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
helm
with values file
image:
tag: v1.29.1
args:
- --v=2
- --allocate-node-cidrs=true
- --cloud-provider=aws
- --cluster-name="k993aws"
- --cluster-cidr="10.0.0.0/16"
- --configure-cloud-routes=false
/kind bug
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Once the Node object internal IP is set ; it should not be replaced by the one of the other ENI.
I don't think IP is replaced. kubectl shows just the one IP but the node object should have all the IP's as that is the default behavior and they should be ordered based on the interface number
cloud-provider-aws/pkg/providers/v1/aws.go
Lines 735 to 757 in cea2af6
Node controller will make sure that addresses of the instance is always same as node object addresses https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/node/node_controller.go#L193-L197.
What I cannot explain though, is why this secondary private IP attached to the same instance is not answering properly those calls even though the firewall was allowing any kind of traffic from any source.
what is the error you are facing , did you check the apisever logs for the reason? It could be because of cert verification also.
Do you pass the --node-ip
flag to kubelet
?
what is the error you are facing , did you check the apisever logs for the reason? It could be because of cert verification also.
It's been quite some time, I have to dig back into it to get extra details. I had problem getting back results from commands like kubectl logs
or kubectl exec
being proxified by the api-server
to the correct node's kubelet
.
Do you pass the --node-ip flag to kubelet?
I had to do that as a workaround, yes. The IP I have set is the primary IP of initial network interface of the EC2 instance.
As soon as I have time, I'll try to get some more informations and put them here.
There's been some recent discussions about --node-ip
and how the external CCM should handle it. At this point, passing --node-ip
to kubelet is the right thing to do, for AWS at least. Here's how we do it for the AL2-based EKS AMI: https://github.com/awslabs/amazon-eks-ami/blob/e50acfb7e6be088dde823dc80b21c50651e71b01/templates/al2/runtime/bootstrap.sh#L490-L495
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale