aws/amazon-vpc-cni-k8s

/run/xtables.lock created as directory when installed with Helm

kwohlfahrt opened this issue · 13 comments

What happened

The file /run/xtables.lock is created as a directory on the host machine. This breaks things that expect it to be a file, including kube-proxy. kube-proxy being unavailable causes the CNI to be unable to reach the API-server, and the node remains stuck in NotReady state.

Attach logs

Events from kube-proxy pod:

Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  4m2s (x60 over 109m)  kubelet  MountVolume.SetUp failed for volume "xtables-lock" : hostPath type check failed: /run/xtables.lock is not a file

What you expected to happen:

I expect /run/xtables.lock to be created as a file.

How to reproduce it (as minimally and precisely as possible):

This is a race condition that occurs fairly rarely, it depends on whether kube-proxy or the AWS CNI daemonset is started first. But it could probably be reproduced by:

  1. Installing the AWS CNI from the Helm chart
  2. Deleting /run/xtables.lock on the host
  3. Restart the AWS CNI DaemonSet pod on the host

Anything else we need to know?:

I'll create a PR with a fix shortly.

Environment:

  • Kubernetes version (use kubectl version): 1.28.7
  • CNI Version: Helm Chart 1.13.0
  • OS (e.g: cat /etc/os-release): ubuntu 22.04
  • Kernel (e.g. uname -a): 5.15.0-1055-aws

I expect /run/xtables.lock to be created as a file

What operating system or AMI are you in?

I have experienced this issue running the AL2023 1.26, 1.27, and 1.28 EKS optimized AMIs. I worked around this by creating the file in the user-data script when the node boots. Obviously this isn't a great solution.

The kube-proxy manifest includes the FileOrCreate directive when defining the /run/xtables.lock volume. kube-proxy usually is launched before the aws-node pod. However if aws-node launches before kube-proxy, aws-node seems to be creating the /run/xtables.lock as a directory. I have seen some odd behavior as a result of this.

However if aws-node launches before kube-proxy, aws-node seems to be creating the /run/xtables.lock as a directory.

That's interesting. Thanks for the adding this detail.

That's interesting. Thanks for the adding this detail.

I think this may be the actual cause. I have this same behavior without running the helm chart and using the latest 1.16.4 release manifest. I am not getting any error logs in aws-node however the pods that run on the node where aws-node launches before kube-proxy fail with tcp timeouts (unable to connect to resources outside the cluster. SG's and iam all configured properly)

kube version: 1.28
AMI: ami-0f4be968a0e634cd3 - AL2023 eks 1.28
Nodes are being launched via Karpenter.

What operating system or AMI are you in?

We are using a custom AMI, based on Ubuntu 22.04.

However if aws-node launches before kube-proxy, aws-node seems to be creating the /run/xtables.lock as a directory.

Yes, exactly - though it's technically the kubelet (not the aws-node process) that creates the directory, because aws-node specifies it as a volume, without specifying the type. If it does not already exist, it is created as a directory.

Yes, exactly - though it's technically the kubelet (not the aws-node process) that creates the directory, because aws-node specifies it as a volume, without specifying the type. If it does not already exist, it is created as a directory.

Absolutely. Poor word choice on my part.

Hi, we have seen the same issue on our 22.04 nodes. We are testing out setting FileOrCreate and will let you know if it resolves the problem.

We are testing out setting FileOrCreate and will let you know if it resolves the problem.

That will be super helpful.

I am not getting any error logs in aws-node however the pods that run on the node where aws-node launches before kube-proxy fail with tcp timeouts (unable to connect to resources outside the cluster. SG's and iam all configured properly)

After updating to the march 7th AMI for EKS optimized AL2023, the network issues have resolved. We are still creating /run/xtables.lock manually in the userdata script of the nodes.

We are testing out setting FileOrCreate and will let you know if it resolves the problem.

That will be super helpful.

It seems to have resolved the issue.

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

That's interesting. Thanks for the adding this detail.

I think this may be the actual cause. I have this same behavior without running the helm chart and using the latest 1.16.4 release manifest. I am not getting any error logs in aws-node however the pods that run on the node where aws-node launches before kube-proxy fail with tcp timeouts (unable to connect to resources outside the cluster. SG's and iam all configured properly)

Hey @Preston-PLB, have you actually reproduced the particular case you described ^^? We've hit something similar but kube-proxy started before aws-node (in correct order); all traffic from the faulty node was timeouting. Including kubelet hitting Kube API, kubelet hitting probes, metrics-server scraping other nodes; everything was timeouting; weirdly workloads could schedule on the node normally. I tried to reproduce according to the OP steps and indeed I got the end result of MountVolume.SetUp failed for volume "xtables-lock". However it's different from what you (and I) described/experienced.

Hey @Preston-PLB, have you actually reproduced the particular case you described ^^? We've hit something similar but kube-proxy started before aws-node (in correct order); all traffic from the faulty node was timeouting. Including kubelet hitting Kube API, kubelet hitting probes, metrics-server scraping other nodes; everything was timeouting; weirdly workloads could schedule on the node normally. I tried to reproduce according to the OP steps and indeed I got the end result of MountVolume.SetUp failed for volume "xtables-lock". However it's different from what you (and I) described/experienced.

@Kyslik Yes I have been able to reproduce this. And you are correct it has nothing to do with the launch order of kube-proxy and aws-node. When I was testing, I was running on a March version of the AL2023 EKS AMI. It turns out on some AL2023 AMIs there is a race condition between the deamon provided by the AMI to configure ENIs and aws-node. This is what leads to the network timeouts. I switched to AL2 before I learned this, and everything worked perfectly. I want to try the newer AL2023 AMIs but need the time and space to potentially break my dev environment.

In short, if you are running AL2023 update to the latest version of the AMI. If you can tolerate switching to AL2 that also works. If you are not on AL2023 I would look for any service on the node that attempts to configure ENIs and kill it, and see if that helps