aws/amazon-vpc-cni-k8s

DataStore has no available IP/Prefix addresses

doryer opened this issue · 14 comments

What happened:
There are error logs that we see also in the old version and new versions of the CNI, indicating that the sandbox DataStore has no available IP/Prefix addresses. From looking on the nodes, seems like those are pretty empty nodes ( 10 Pods for example ) which shouldn't have issue on assigning and also we see it pretty wide across all aws-node pods.

We’re not sure, even after reading the source code, if there’s an impact on the CNI performance, nor if it causes it issues with allocating IPv4 addresses to pods.

Attach logs
ERROR logs:
{"level":"error","ts":"2023-11-16T10:05:03.920Z","caller":"datastore/data_store.go:682","msg":"DataStore has no available IP/Prefix addresses"}
WARN logs that being output near the time of the ERROR logs:
{"level":"warn","ts":"2023-11-16T10:06:18.306Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/83faffebdfb826372e05d2bc67b6a3e8c40dd4b07dd034fafad9f3838aafa985/unknown"}

Which also after reading the relvant code we're not sure why we see the migration_from_cri and unknown fields.

What you expected to happen:

aws-node pods should be assigning IPv4 addresses properly. without errors.

How to reproduce it (as minimally and precisely as possible):

We're not really sure as well

Environment:

  • Kubernetes version (use kubectl version): v1.25.15
  • CNI Version: v1.15.1
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.6 LTS
  • Kernel (e.g. uname -a): 5.15.0-1037-aws

@doryer those logs are not out of the ordinary if the node is scaling up and attaching new ENIs. Are you able to schedule pods on this node? Does the subnet that this node was deployed in have available IPv4 addresses?

Essentially, the IPAM daemon attaches ENIs to the node as demand and configuration (minimum/warm targets) requires. If no IPs are available then either:

  1. The instance has reached its ENI limit (no more ENIs can be attached) and all IPs assigned to the node are in use.
  2. The subnet is out of available IPv4 addresses, so no new ENIs can be attached or IPs assigned.
  3. There is a configuration error preventing ENIs from being attached.

@doryer those logs are not out of the ordinary if the node is scaling up and attaching new ENIs. Are you able to schedule pods on this node? Does the subnet that this node was deployed in have available IPv4 addresses?

Essentially, the IPAM daemon attaches ENIs to the node as demand and configuration (minimum/warm targets) requires. If no IPs are available then either:

  1. The instance has reached its ENI limit (no more ENIs can be attached) and all IPs assigned to the node are in use.
  2. The subnet is out of available IPv4 addresses, so no new ENIs can be attached or IPs assigned.
  3. There is a configuration error preventing ENIs from being attached.

Hey @jdn5126 thanks for the quick reply.
So i choose aws-node pod which outputs those errors, this is node with 10 Pods running at him ( including DS pods which is hostNetwork ), the node is from type r5.4xlarge and have 1 ENI attached to the node. also i see since the node joined the cluster he is logging those errors, and pods are still getting scheduled on the node, So:

  1. the instance type have ENI limit of 8, we're at 1 attached.
  2. Subnet have more than 10K available IP addresses, our cluster running in 1 subnet in 1 AZ.
  3. Can you suggest which configuration? i'm pretty lost how to point this error to configuration
    Thanks !

Can you collect the node logs from this node and email them to k8s-awscni-triage@amazon.com? I can dig into this further.

To collect the node logs, you can run /opt/cni/bin/aws-cni-support.sh with admin permissions on the node and then email the generated tarball.

k8s-awscni-triage@amazon.com

No problem, sent it to you

@doryer the IP pool cannot be increase on each node because IMDS calls are getting 404s:

2023-11-21T11:00:08.555293408Z stdout F {"level":"warn","ts":"2023-11-21T11:00:08.555Z","caller":"retry/retry.go:70","msg":"Failed to increase pool, error trying to discover attached ENIs on attempt 1/12: get attached ENIs: failed to retrieve ENI metadata for ENI: 0a:f2:4d:c6:fc:4f: EC2MetadataError: failed to make EC2Metadata request\n<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n\t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n <head>\n  <title>404 - Not Found</title>\n </head>\n <body>\n  <h1>404 - Not Found</h1>\n </body>\n</html>\n\n\tstatus code: 404, request id:  "}

Can you confirm that nodes are able to curl IMDS? You can run this from a node (ref: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html):

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/

I am not sure why nodes would not be able to reach IMDS unless you are explicitly blocking it with some network policy. I also see certificate errors in the node logs trying to curl the API server.

As a side note, what do you have AWS_VPC_K8S_PLUGIN_LOG_FILE set to? I see that AWS_VPC_K8S_CNI_LOG_FILE is set to stderr, which I would not recommend as then you lose historical logs. The lifetime of the logs is the same as the lifetime of the pod.

Failed to increase pool

Hey, so about IMDS, i've runned the command on the nodes and it does work, otherwise we would have lots of issues so I'm not sure why he suffers from this issue, also we see it on some specific nodes when the rest of the nodes in the cluster are running fine.

Regarding IMDS, we see issues on other nodes in which we get wrong results from IMDS on the machine, we get network interfaces which do not exist anymore in AWS.

About log files, AWS_VPC_K8S_CNI_LOG_FILE is set to stdout and AWS_VPC_K8S_PLUGIN_LOG_FILE is set stderr because it fits to our shippers.

@doryer I have seen IMDS out-of-sync issues before, but the VPC CNI periodically syncs with IMDS, so they should resolve within a short amount of time. If the issue persists, that points to an instance-level issue, and we will need help from the EC2 team to investigate. Can you file a support ticket from your AWS console? I think that's the best next step to get resolution

@doryer I have seen IMDS out-of-sync issues before, but the VPC CNI periodically syncs with IMDS, so they should resolve within a short amount of time. If the issue persists, that points to an instance-level issue, and we will need help from the EC2 team to investigate. Can you file a support ticket from your AWS console? I think that's the best next step to get resolution

So we had a few nodes where the IMDs were not synced (I don't know how is it possible as well ). we terminated them.
But we keep getting logged with those errors, I've sshed to the machine and ran the IMDs metadata command and got a valid response with the rights ENI attached to the machine, so I'm not sure where is the out-of-sync problem that you see in the logs. Any idea what else to check?

BTW, another thing I see, that i see increase in the awscni_aws_api_error_count metric on the api: DeleteNetworkInterface with the error of: InvalidParameterValue ( labels from the metric ).
But by looking on the logs, i see only warning logs which similliar to the warning logs i've attached in the main message
{"level":"warn","ts":"2023-11-29T12:25:09.528Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/8d7a0a5acc2dd0c231fa0eba6a9c9530ea4e626350ac4921410842140f911ee4/unknown"}

Hmm... nothing immediately comes to mind. Can you open an AWS support case for this issue? That will allow us to dig deeper into the instances themselves. More progress can be made through that channel than on this thread.

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

/not stale

/remove-lifecycle stale

Closing this issue, as its resolution will need to go through the support case

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.