aws/amazon-vpc-cni-k8s

EKS upgrading from 1.24 to 1.25 VPC CNi issues

aguzmans opened this issue · 14 comments

What happened?
After upgrading EKS from v1.24 to 1.25, the cluster broke. It creates new nodes, but they are never ready.

The cluster has been upgraded a few times before no issiues, and when it was created we did not touch or work with CNI as fas as I remeber, that was years ago.
Now it's failing on something CNI related after upgrading the masters to 1.25. I am ok with having CNI or not as the cluster is very small and it just worked. So to try to fix I tried to deploy the CNI from helm charts and it did not help. here is the current situation.
Nodes:

NAME                                             STATUS     ROLES    AGE   VERSION
ip-10-1-16-66.region.compute.internal    Ready      <none>   28d   v1.24.16-eks-8ccc7ba
ip-10-1-17-106.region.compute.internal   Ready      <none>   17d   v1.24.16-eks-8ccc7ba
ip-10-1-19-94.region.compute.internal    NotReady   <none>   20h   v1.24.16-eks-8ccc7ba

Old nodes are still ready because they are old. the one that is 20h is a newnode after the upgrade to 1.25.

kubectl get pods -n kube-system -l k8s-app=aws-node -o wide
NAME             READY   STATUS             RESTARTS        AGE   IP            NODE                                             NOMINATED NODE   READINESS GATES
aws-node-2bc8b   1/2     Running            0               72s   10.1.19.94    ip-10-1-19-94.region.compute.internal    <none>           <none>
aws-node-4ts2g   1/2     CrashLoopBackOff   476 (51s ago)   20h   10.1.16.66    ip-10-1-16-66.region.compute.internal    <none>           <none>
aws-node-z88fv   0/2     CrashLoopBackOff   476 (13s ago)   20h   10.1.17.106   ip-10-1-17-106.region.compute.internal   <none>           <none>

CNI support script:

[ec2-user@ip-10-1-19-94 ~]$ sudo bash /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    56  100    56    0     0  19178      0 --:--:-- --:--:-- --:--:-- 28000

	This is version 0.7.6. New versions can be found at https://github.com/awslabs/amazon-eks-ami/blob/master/log-collector-script/

Trying to collect common operating system logs...
Trying to collect kernel logs...
Trying to collect modinfo... Trying to collect mount points and volume information...
Trying to collect SELinux status...
Trying to collect iptables information...
Trying to collect installed packages...
Trying to collect active system services...
Trying to Collect Containerd daemon information...
Trying to Collect Containerd running information...
Trying to Collect Docker daemon information...
Trying to collect kubelet information...
Trying to collect L-IPAMD introspection information... Trying to collect L-IPAMD prometheus metrics... Trying to collect L-IPAMD checkpoint... cp: cannot stat '/var/run/aws-node/ipam.json': No such file or directory

Trying to collect Multus logs if they exist...
Trying to collect sysctls information...
Trying to collect networking infomation... conntrack v1.4.4 (conntrack-tools): 206 flow entries have been shown.

Trying to collect CNI configuration information... cp: cannot stat '/etc/cni/net.d/*': No such file or directory

Trying to collect Docker daemon logs...
Trying to Collect sandbox-image daemon information...
Trying to Collect CPU Throttled Process Information...
Trying to Collect IO Throttled Process Information...
Trying to archive gathered information...

	Done... your bundled logs are located in /var/log/eks_i-09e960d71c207886a_2023-11-04_0111-UTC_0.7.6.tar.gz

a pod describe show the following events:

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  41s   default-scheduler  Successfully assigned kube-system/aws-node-2bc8b to ip-10-1-19-94.ap-northeast-2.compute.internal
  Normal   Pulled     40s   kubelet            Container image "ID-west-x.amazonaws.com/amazon-k8s-cni-init:v1.15.1" already present on machine
  Normal   Created    40s   kubelet            Created container aws-vpc-cni-init
  Normal   Started    40s   kubelet            Started container aws-vpc-cni-init
  Normal   Pulled     39s   kubelet            Container image "ID-west-x.amazonaws.com/amazon-k8s-cni:v1.15.1" already present on machine
  Normal   Created    39s   kubelet            Created container aws-node
  Normal   Started    39s   kubelet            Started container aws-node
  Normal   Pulled     39s   kubelet            Container image "ID-west-x.amazonaws.com/amazon/aws-network-policy-agent:v1.0.4" already present on machine
  Normal   Created    39s   kubelet            Created container aws-eks-nodeagent
  Normal   Started    38s   kubelet            Started container aws-eks-nodeagent
  Warning  Unhealthy  33s   kubelet            Readiness probe failed: {"level":"info","ts":"2023-11-04T01:05:39.919Z","caller":"/root/sdk/go1.20.8/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  27s   kubelet            Readiness probe failed: {"level":"info","ts":"2023-11-04T01:05:45.077Z","caller":"/root/sdk/go1.20.8/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  22s   kubelet            Readiness probe failed: {"level":"info","ts":"2023-11-04T01:05:50.212Z","caller":"/root/sdk/go1.20.8/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  15s   kubelet            Readiness probe failed: {"level":"info","ts":"2023-11-04T01:05:57.367Z","caller":"/root/sdk/go1.20.8/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  5s    kubelet            Readiness probe failed: {"level":"info","ts":"2023-11-04T01:06:07.375Z","caller":"/root/sdk/go1.20.8/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}

This cluster was created using terraform and it just worked every time we upgraded before. Any ideas?

Further information:

oidc_provider=$(aws eks describe-cluster --name my-cluster --region MyRegion --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///")

Gives me what seems to be a valid provider URL with a termination that is part og the server URL in my kuberconfig:

clusters:
- cluster:
    server: https://my-ID-CAPS-string...

that identity provider that matches the above too, it exists with an audience as follow:

Provider: oidc.eks.my-region.amazonaws.com/id/my-ID-CAPS-string.
(...)
Audience: sts.amazonaws.com

then I have a role with the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AssignPrivateIpAddresses",
                "ec2:AttachNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:DeleteNetworkInterface",
                "ec2:DescribeInstances",
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstanceTypes",
                "ec2:DetachNetworkInterface",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:UnassignPrivateIpAddresses"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*"
            ]
        }
    ]
}

Notice: that my cluster is IPv4 only, but anyways... and the role has the same policy and trust relationship, checked it with copy past and every thing seems to add up there.

then I had also ran the creation of the SA/annotation as per the docs. What else should I do to troubleshoot this?

Notice: finally that I have other EKS clusters that apparently have CNI working, but command like this was ever ran as per the docs mentioned above:

kubectl annotate serviceaccount \
    -n kube-system aws-node \
    eks.amazonaws.com/role-arn=arn:aws:iam::111122223333:role/AmazonEKSVPCCNIRole

so the checkup command on those other clusters gives an empty output:

$ kubectl describe pod -n kube-system aws-node-l9bt5 | grep 'AWS_ROLE_ARN:\|AWS_WEB_IDENTITY_TOKEN_FILE:'
$

While in this cluster that is not the case.

@aguzmans did you update kube-proxy and coredns addons as well? The VPC CNI, kube-proxy, and coredns are required addons, and their versions need to be compatible with the Kubernetes version. You can find the recommended versions here:
kube-proxy: https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html
coredns: https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html

My assumption is that this is your issue, but if not, then we can take a look at the node logs that you collected.

hi @jdn5126 ,
Thanks for your response. I will read the docs about kube-proxy and coredns you shared. Aside I did a lot of things and nothing worked. For some reason that I can't explain in this old cluster where we did not use AWS EKS addons from the AWS console. After the upgrade all kube-proxy, coredns and aws-node continue to misbehave. So I removed them and installed via the console using the EKS addons feature.

That seems to have solved some of the issues.
Now I am having some further issues to solve with the metrics server, I am getting this on virtually every command:

kubectl get no -o wide
E1108 09:29:17.676797   73105 memcache.go:287] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1108 09:29:17.822758   73105 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1108 09:29:17.900853   73105 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E1108 09:29:17.975049   73105 memcache.go:121] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

The apiservices seems to have some issues:

kubectl get apiservices

NAME                                   SERVICE                      AVAILABLE                  AGE
(...)
v1beta1.metrics.k8s.io                 kube-system/metrics-server   False (MissingEndpoints)   105d
...

Then also:

$ kubectl top no
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
$ kubectl top po
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)

and the metrics server says this in the logs:

Error from server: Get "https://10.1.18.52:10250/containerLogs/kube-system/metrics-server-7f8b7f9955-nppcx/metrics-server?follow=true": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.1.18.52

My metrics server is from helm charts and I keep it probably very vanilla.
I am not sure why, but I will try to figure that out today.

Hmm.. I am not very familiar with https://github.com/kubernetes-sigs/metrics-server, but given your symptoms, an AWS support case sounds like the best next step for help with the high-level debugging.

If you are bringing up a new 1.25 node and have the latest kube-proxy, coredns, and VPC CNI addons installed, but the node is still not ready, the next thing we would want to look at are the node logs that you collected. You can email them to k8s-awscni-triage@amazon.com if you want

I think we're seeing the same issue when upgrading CNI plugin from 1.14 to 1.15 on EKS 1.25.
In our case, the aws-node logs indicate that some older v1alpha1 version of the amazon-vpc-resource-controller-k8s control plane API is missing:

{"level":"error","ts":"2023-11-08T19:19:53.374Z","caller":"ipamd/ipamd.go:423","msg":"Failed to add feature custom networking into CNINode%!(EXTRA *fmt.wrapError=failed to get API group resources: unable to retrieve the complete list of server APIs: vpcresources.k8s.aws/v1alpha1: the server could not find the requested resource)"}
{"level":"error","ts":"2023-11-08T19:19:53.374Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: failed to get API group resources: unable to retrieve the complete list of server APIs: vpcresources.k8s.aws/v1alpha1: the server could not find the requested resource"}

In ipamd.go there's a reference to github.com/aws/amazon-vpc-resource-controller-k8s/apis/vpcresources/v1alpha1 but the only CRD resource in the 1.25 EKS cluster pertaining to vpcresources is already v1beta1.

Do you see the cninodes.vpcresources.k8s.aws CRD in your cluster? Is it possible that you intentionally (or unintentionally) uninstalled it?

Are you configuring Security Groups for Pods? You mentioned CNI 1.15. Which version? The latest, aka v1.15.3?

Do you see the cninodes.vpcresources.k8s.aws CRD in your cluster? Is it possible that you intentionally (or unintentionally) uninstalled it?

Are you configuring Security Groups for Pods? You mentioned CNI 1.15. Which version? The latest, aka v1.15.3?

No, that CRD is not in our cluster apparently. I'm looking at 1.15.1 CNI upgraded from 1.14.1, let me check 1.15.3.

Edit: We configured custom ENI networking, works fine with 1.14.1

Do you see the cninodes.vpcresources.k8s.aws CRD in your cluster? Is it possible that you intentionally (or unintentionally) uninstalled it?
Are you configuring Security Groups for Pods? You mentioned CNI 1.15. Which version? The latest, aka v1.15.3?

No, that CRD is not in our cluster apparently. I'm looking at 1.15.1 CNI upgraded from 1.14.1, let me check 1.15.3.

That CRD is installed by a control plane component, the VPC Resource Controller. It is possible that you uninstalled it, but to dig in further, we would need to look at the API server audit logs

There's no chance we uninstalled anything when going from 1.24 to 1.25 EKS.

We updated the relevant Terraform resource and let it be applied, then drained all 1.24 nodes Karpenter created and watched them be replaced with 1.25 ones.

When we tried to install the (then latest) 1.15.1 CNI we saw this error happening in our dev environments, so we only upgraded to 1.14.1 in our stable environments.

Is it possible that simply upgrading to 1.25 EKS did not install said CRDs?

Also, if it's the VPC Resource Controller installing this, why would it install a v1beta1 version of the securitygrouppolicies API resources while the CNI plugin is referencing v1alpha1 types of the VPC Resource Controller?

Edit: Looking at the CRDs, I noticed that the CNINode is only in v1alpha1 of that.
What's puzzling is that the v1beta1 CRD for SecurityGroupPolicies is in the same kustomization that installs the v1alpha1 CNINodes, so I don't see how one would be installed without the other.

@gnadaban the most likely scenario is that upgrading from 1.24 to 1.25 did not properly install the CNINode CRD. This CRD is needed when using Security Groups for Pods in VPC CNI v1.15.0+. If you are not configuring Security Groups for Pods, then you are likely hitting #2584 and can upgrade to v1.15.3 to avoid this error.

Regardless of that bug, though, the controller not installing the CNINode CRD in EKS 1.25 is a problem, and you will need to file a support case to have that investigated further. Please link to this GitHub issue in the support request. As for why the VPC Resource Controller uses v1beta1 for securitygrouppolicies and v1alpha1 for cninodes, let's ignore this part, as it is a red herring. The short answer is that these are the original versions and upgrading CRD versions can cause a multitude of different errors for customers, so we avoid it unless absolutely necessary.

Thanks for your suggestion @jdn5126 ! Following your logic I was able to confirm that some upgraded clusters actually have the CNINode CRD, which led me to find a cleanup script that did not account for the newly introduced CRD in regression testing dev clusters.

In my case I had the CRD:

$ kubectl get crds
NAME                                         CREATED AT
cninodes.vpcresources.k8s.aws                2023-08-09T01:59:38Z
(...)

The problem with the metrics server was solved by draining all nodes and removing the helm installed metrics server and installing from yaml as recommended by AWS. It all seems to be working now.

So to recap all the changes I did that seem to have fixed the issues:
1- Remove the VPC CNI, kube-proxy & CoreDNS maintained from our end and install the latest AWS EKS add-ons from console.
2- Removed metrics server and installed as recommended by AWS.
3- Drain all the nodes and create new nodes.

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.