aws/amazon-vpc-cni-k8s

IPAMD RPC connection refused - DelNetwork fails

alam0rt opened this issue · 5 comments

What happened:

A pod was stuck terminating for over a day, looking at containerd logs revealed that CNI was having issues when asked to tear sandbox down.

We performed a rollout of the CNI yesterday which I believe has something to do with it (we didn't change version, only some termination grace period changes to the daemonset).

crictl ps showed the pod as exited.

    "state": "CONTAINER_EXITED",
    "createdAt": "2023-08-01T11:30:02.598164927Z",
    "startedAt": "2023-08-01T11:30:02.712140282Z",
    "finishedAt": "2023-08-03T22:05:12.396785268Z",

time of check Thu Aug 3 23:34:01 UTC 2023

routed-eni logs were littered with

{"level":"error","ts":"2023-08-03T23:36:38.052Z","caller":"routed-eni-cni-plugin/cni.go:287","msg":"Error received from DelNetwork gRPC call for container b9ed430757d7af66cff15e271917c87efdef372a3fb06118fa386f2f47f5c23e: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused\""}

containerd

{"error":"failed to destroy network for sandbox \"6b9344223bb198613612282a657a0222dad4c46f78377207e32d50f8ed692bdb\": plugin type=\"aws-cni\" name=\"aws-cni\" failed (delete): del cmd: error received from DelNetwork gRPC call: r...

Attach logs

What you expected to happen:

Pods terminate.

How to reproduce it (as minimally and precisely as possible):

Just a guess:

  1. Deploy v1.12.6
  2. Start some pods
  3. Rollout daemonset and restart CNI pods
  4. Try terminating pods (the issue only seems to affect some pods though)

Anything else we need to know?:

MIght be related to containerd/containerd#8847

Environment:

  • Kubernetes version (use kubectl version): v1.25.11
  • CNI Version: v1.12.6
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.6 LTS
  • Kernel (e.g. uname -a): 5.15.0-1039-aws

Maybe related #2350

@alam0rt this is likely related to #2350. Before this, if CNI could not reach IPAMD, it would fail to delete the pod and rely on kubelet to continue retrying the delete until IPAMD is reachable. If the node is under heavy CPU/memory contention and the aws-node pod cannot be scheduled (due to required requests/limits), then IPAMD would not come up.

Did you check if the aws-node pod was running during this time? As for the fix, it is in v1.13.0+, but I recommend upgrading to v1.13.3+ as there is increased memory usage in v1.12.6-v1.13.2 that is fixed in v1.13.3+.

I've actually just deployed 1.13.4 so I'll keep an eye out! Thanks

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@alam0rt were you able to resolve?