aws/amazon-vpc-cni-k8s

Custom Networking ENIs fail to attach in v1.13.0 when no security groups are specified

jdn5126 opened this issue · 4 comments

What happened:
In v1.13.0, #2354 was introduced, which improved custom networking startup time by attaching custom ENIs during node initialization. This introduced an issue, though, when no security group was associated with the ENIConfig object. In this case, we should fall back to the security group assigned to the node's primary ENI.

The security group assigned to the node's primary ENI comes from the EC2 metadata cache, though, and this value is not synced until after node initialization: https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.13.0/pkg/ipamd/ipamd.go#L450

Our retry strategy in v1.13.0 is to let the aws-node pod crash so that the node does not become ready for pods to be scheduled. When the aws-node pod restarts, we hit the same issue, as the cache never syncs the primary ENI's security group ID until after node initialization.

Attach logs

{"level":"info","ts":"2023-06-15T21:55:31.095Z","caller":"ipamd/ipamd.go:903","msg":"Found ENI Config Name: us-west-2b"}
{"level":"info","ts":"2023-06-15T21:55:31.196Z","caller":"ipamd/ipamd.go:873","msg":"ipamd: using custom network config: [], subnet-07f48a842d5d33cee"}
{"level":"info","ts":"2023-06-15T21:55:31.196Z","caller":"awsutils/awsutils.go:763","msg":"Using a custom network config for the new ENI"}
{"level":"warn","ts":"2023-06-15T21:55:31.196Z","caller":"awsutils/awsutils.go:763","msg":"No custom networking security group found, will use the node's primary ENI's SG: []"}
{"level":"info","ts":"2023-06-15T21:55:31.196Z","caller":"awsutils/awsutils.go:763","msg":"Creating ENI with security groups: [] in subnet: subnet-07f48a842d5d33cee"}
{"level":"error","ts":"2023-06-15T21:55:31.299Z","caller":"awsutils/awsutils.go:763","msg":"Failed to CreateNetworkInterface InvalidParameterValue: user [REDACTED] does not own a resource\n\tstatus code: 400, request id: e12e797e-19f5-47fa-8d8e-4a5de5a677b7"}

What you expected to happen:
Custom ENIs should be attached during node initialization, and there should be a better retry strategy.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy VPC CNI v1.13.0
  2. Create an ENIConfig with no security groups
  3. See error

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): does not matter
  • CNI Version: v1.13.0
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): does not matter

The workaround for this issue in v1.13.0 is to specify any security group ID when defining an ENIConfig. It is recommended that you choose a no-op security group, like the cluster security group or VPC default security group ID.

cparik commented

+1
Observed the same issue. The workaround helps to get the nodes to join the cluster.

Closing as this is fixed in v1.13.2 release

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.