aws/amazon-vpc-cni-k8s

Can't use aws-node without VPC Resource Controller?

mattburgess opened this issue · 12 comments

What happened:

I tried upgrading our v1.12.6 aws-node daemonset to v1.15.0 but the pods fail to start up. They log the following:

{"level":"info","ts":"2023-09-26T14:17:02.870Z","caller":"ipamd/ipamd.go:550","msg":"Get Node Info for: ip-10-50-97-227.eu-west-1.compute.internal"}
{"level":"error","ts":"2023-09-26T14:17:02.974Z","caller":"ipamd/ipamd.go:423","msg":"Failed to add feature custom networking into CNINode%!(EXTRA *fmt.wrapError=failed to get API group resources: unable to retrieve the complete list of server APIs: vpcresources.k8s.aws/v1alpha1: the server could not find the requested resource)"}
{"level":"error","ts":"2023-09-26T14:17:02.974Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: failed to get API group resources: unable to retrieve the complete list of server APIs: vpcresources.k8s.aws/v1alpha1: the server could not find the requested resource"}

That seems to be due to #2503. I took a look at the various env vars but couldn't see anything there, or in the code, that makes this feature optional. Having a hard dependency on a controller that looks specifically designed for EKS means it looks like we can't upgrade to this release on a non-EKS-but-still-hosted-in-AWS cluster. Have I understood things correctly or have I missed something in the docs?

Attach logs

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.24.17
  • CNI Version: 1.15.0
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04
  • Kernel (e.g. uname -a): 5.19.0-1029-aws

@mattburgess how did you upgrade from v1.12.6 to v1.15.0? Did you install the full manifest or helm chart?

The VPC CNI has always had a dependency on the VPC Resource Controller for certain features, and a new CRD was introduced in v1.15.0. It sounds like that CRD is not installed in your cluster

As per the release note at https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.15.0 I applied the full manifest from https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.15.0/config/master/aws-k8s-cni.yaml. Well, full disclosure, I applied all of that except for the aws-network-policy-agent container as we have enable-network-policy-controller: "false" set in the ConfigMap so didn't think it was necessary to run it. I can see the policyendpoints.networking.k8s.aws CRD in our cluster but can't see anything related to vpcresources.k8s.aws.

Just for clarity, these are the 2 cni-related CRDs I have installed:

$ kubectl sandbox get crd | grep aws
eniconfigs.crd.k8s.amazonaws.com                     2023-09-25T01:24:32Z
policyendpoints.networking.k8s.aws                   2023-09-26T13:55:33Z

Got it, so it looks like the VPC CNI does have a hard dependency on the CNINode CRD that the VPC Resource Controller installs: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/config/crd/bases/vpcresources.k8s.aws_cninodes.yaml due to its Kubernetes client needing to load the schema: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/k8sapi/k8sutils.go#L115

In EKS, the VPC Resource Controller installs this CRD, so your issue lines up. You could argue that the VPC CNI should also try to install this CRD to prevent this issue, as otherwise there is a hard dependency on the controller being present

In the meantime, you can manually install the CRD to avoid this issue, as you do not depend on any VPC RC features

@mattburgess I am discussing with the team internally how we should handle this, as we definitely need to support VPC CNI running in Kubernetes without a dependency on the EKS control plane

Thanks for the super quick turnaround on this @jdn5126. Does the following suggest I still might need the controller in place though? This is after I've installed the CNINode CRD as you previously suggested:

{"level":"info","ts":"2023-09-27T08:39:32.365Z","caller":"ipamd/ipamd.go:550","msg":"Get Node Info for: ip-10-50-96-39.eu-west-1.compute.internal"}
E0927 08:39:32.471903      10 reflector.go:148] pkg/mod/k8s.io/client-go@v0.27.3/tools/cache/reflector.go:231: Failed to watch *v1alpha1.CNINode: unknown (get cninodes.vpcresources.k8s.aws)
{"level":"error","ts":"2023-09-27T08:39:32.570Z","caller":"ipamd/ipamd.go:423","msg":"Failed to add feature custom networking into CNINode%!(EXTRA *errors.StatusError=CNINode.vpcresources.k8s.aws \"ip-10-50-96-39.eu-west-1.compute.internal\" not found)"}
{"level":"error","ts":"2023-09-27T08:39:32.570Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: CNINode.vpcresources.k8s.aws \"ip-10-50-96-39.eu-west-1.compute.internal\" not found"}

Ah sorry @mattburgess, I should have looked more closely at the error. You have custom networking configured, so you are failing at https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L564 as IPAMD is trying to patch the CNINode resource to let the controller know that custom networking is enabled, but the resource does not exist as it was not created by the controller.

The intent here is for VPC CNI to be able to run without the controller, but for advanced features to only be possible with the controller. So the issue you are seeing is a bug that we need a code change for. We only need to let the controller know that custom networking is enabled when Security Groups for Pods (a controller-only feature) is enabled. I can get this fix in v1.15.1, which is targeting mid-October.

Ah sorry @mattburgess, I should have looked more closely at the error.

Although they look similar it's definitely a different error without then with the CRD in place.

We only need to let the controller know that custom networking is enabled when Security Groups for Pods (a controller-only feature) is enabled. I can get this fix in v1.15.1, which is targeting mid-October.

That's awesome! Thanks again.

Yep, the error is different, but it resolves to the same root cause: running a Kubernetes operation (GET, PATCH, WATCH) against a resource that either does not exist and/or does not have a CRD loaded.

#2591 resolves this by making sure that we never patch the CNINode resource unless a controller feature is enabled. #2570 makes sure that we never issue a WATCH for CNINode ever.

Closing now that v1.15.1 is released on GitHub

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.