container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Question

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

BartoszZawadzki opened this issue 2 months ago · 3 comments

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.24.5 (git-v1.24.5)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: v1.24.17

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops validate cluster

5. What happened after the commands executed?

Validating cluster dev.k8s.sgr-cloud.sh

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
app-arm-eu-west-1a	Node	t4g.large	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1a
app-arm-eu-west-1b	Node	t4g.large	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1b
app-arm-eu-west-1c	Node	t4g.large	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1c
app-eu-west-1a		Node	t3a.2xlarge	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1a
app-eu-west-1b		Node	t3a.2xlarge	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1b
app-eu-west-1c		Node	t3a.2xlarge	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1c
bastions		Bastion	t3.small	1	1	dev.k8s.sgr-cloud.sh-public-eu-west-1a,dev.k8s.sgr-cloud.sh-public-eu-west-1b,dev.k8s.sgr-cloud.sh-public-eu-west-1c
ci-eu-west-1a		Node	t3a.2xlarge	0	10	dev.k8s.sgr-cloud.sh-private-eu-west-1a
gpu-eu-west-1a		Node	g5.2xlarge	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1a
gpu-eu-west-1b		Node	g5.2xlarge	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1b
gpu-eu-west-1c		Node	g5.2xlarge	0	15	dev.k8s.sgr-cloud.sh-private-eu-west-1c
infra-eu-west-1a	Node	t3a.2xlarge	1	3	dev.k8s.sgr-cloud.sh-private-eu-west-1a
infra-eu-west-1b	Node	t3a.2xlarge	1	3	dev.k8s.sgr-cloud.sh-private-eu-west-1b
infra-eu-west-1c	Node	t3a.2xlarge	1	3	dev.k8s.sgr-cloud.sh-private-eu-west-1c
master-eu-west-1a	Master	t3a.xlarge	1	2	dev.k8s.sgr-cloud.sh-private-eu-west-1a
master-eu-west-1b	Master	t3a.xlarge	1	2	dev.k8s.sgr-cloud.sh-private-eu-west-1b
master-eu-west-1c	Master	t3a.xlarge	1	2	dev.k8s.sgr-cloud.sh-private-eu-west-1c

NODE STATUS
NAME			ROLE	READY
i-0432daf1eb766a0df	node	False
i-0845aa28b742d61ce	master	False
i-0c4cd91b01124b439	master	False

VALIDATION ERRORS
KIND	NAME			MESSAGE
Machine	i-02409274b530a2a9b	machine "i-02409274b530a2a9b" has not yet joined cluster
Machine	i-02cb1df09a9ad18eb	machine "i-02cb1df09a9ad18eb" has not yet joined cluster
Machine	i-02f6d0de212e95cd8	machine "i-02f6d0de212e95cd8" has not yet joined cluster
Machine	i-03a2b2086edb7c04f	machine "i-03a2b2086edb7c04f" has not yet joined cluster
Machine	i-0750fc13240a1869b	machine "i-0750fc13240a1869b" has not yet joined cluster
Machine	i-0773e1466ae9be609	machine "i-0773e1466ae9be609" has not yet joined cluster
Machine	i-087147df3d0c7dfd8	machine "i-087147df3d0c7dfd8" has not yet joined cluster
Machine	i-0a5ac944ae1926d6f	machine "i-0a5ac944ae1926d6f" has not yet joined cluster
Machine	i-0c6c938c57e1061fe	machine "i-0c6c938c57e1061fe" has not yet joined cluster
Machine	i-0ef86b3f204d977f3	machine "i-0ef86b3f204d977f3" has not yet joined cluster
Machine	i-0fea2dbdbfd81409f	machine "i-0fea2dbdbfd81409f" has not yet joined cluster
Node	i-0432daf1eb766a0df	node "i-0432daf1eb766a0df" of role "node" is not ready
Node	i-0845aa28b742d61ce	node "i-0845aa28b742d61ce" of role "master" is not ready
Node	i-0c4cd91b01124b439	node "i-0c4cd91b01124b439" of role "master" is not ready

Validation Failed
Error: Validation failed: cluster not yet healthy

6. What did you expect to happen?

I expected validation to pass successfully

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

After describing the thee nodes (two master nodes and one worker node) I've noticed that they all show the same error:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 09 Apr 2024 13:27:47 +0200   Mon, 08 Apr 2024 16:16:07 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 09 Apr 2024 13:27:47 +0200   Mon, 08 Apr 2024 16:16:07 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 09 Apr 2024 13:27:47 +0200   Mon, 08 Apr 2024 16:16:07 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Tue, 09 Apr 2024 13:27:47 +0200   Mon, 08 Apr 2024 16:16:07 +0200   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

I've followed https://kops.sigs.k8s.io/operations/troubleshoot/:

Nodeup doesn't show any issues:

Apr 08 14:56:10 ip-172-20-91-12 nodeup[1380]: success
Apr 08 14:56:10 ip-172-20-91-12 systemd[1]: kops-configuration.service: Succeeded.
Apr 08 14:56:10 ip-172-20-91-12 systemd[1]: Finished Run kOps bootstrap (nodeup).

kube-apiserver shows multiple errors (log file attached)
Both etcd.log and etcd-events.log don't show any errors
kubelet shows multiple errors:

"MESSAGE" : "E0409 10:41:11.892153    3762 kubelet.go:2352] \"Container runtime network not ready\" networkReady=\"NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized\"",

"MESSAGE" : "E0409 10:41:12.275970    3762 kubelet.go:1693] \"Failed creating a mirror pod for\" err=\"Internal error occurred: failed calling webhook \\\"pod-identity-webhook.amazonaws.com\\\": failed to call webh
ook: Post \\\"https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s\\\": no endpoints available for service \\\"pod-identity-webhook\\\"\" pod=\"kube-system/etcd-manager-main-i-0845aa28b742d61ce\"",

Answer 1 · 2024-04-09T11:46:49.000Z

kube-apiserver.log

Answer 2 · 2024-04-09T11:47:28.000Z

This happened after performing kops rolling-update cluster --cloudonly, beforoe that cluster was healthy.

I've also SSH'ed into one on the worker nodes that did not join the cluster and noticed that it failed at Nodeup:

Apr 09 12:58:42 ip-172-20-105-158 nodeup[54814]: I0409 12:58:42.014417   54814 executor.go:155] No progress made, sleeping before retrying 1 task(s)
Apr 09 12:58:52 ip-172-20-105-158 nodeup[54814]: I0409 12:58:52.023456   54814 executor.go:111] Tasks: 77 done / 85 total; 1 can run
Apr 09 12:58:52 ip-172-20-105-158 nodeup[54814]: I0409 12:58:52.023510   54814 executor.go:186] Executing task "BootstrapClientTask/BootstrapClient": BootstrapClientTask
Apr 09 12:58:55 ip-172-20-105-158 nodeup[54814]: W0409 12:58:55.102587   54814 executor.go:139] error running task "BootstrapClientTask/BootstrapClient" (1m38s remaining to succeed): Post "https://kops-controller.internal.dev.k8s.sgr-clXXX.XX:3988/bootstrap": dial tcp 172.20.103.227:3988: connect: no route to host

From what I can see in here: https://kops.sigs.k8s.io/contributing/ports/ 3988 is a kops controller serving port;
This is what I get from kops-controller Pod in kube-system namespace:

Error from server: Get "https://172.20.103.227:10250/containerLogs/kube-system/kops-controller-p9zf5/kops-controller": dial tcp 172.20.103.227:10250: connect: no route to host

Answer 3 · 2024-04-09T15:35:30.000Z

As it turned out the problem was with pod-identity-webhook mutatingwebhookconfigurations.admissionregistration.k8s.io - it had failurePolicy: Fail and because we did kops rolling-update --cloudonly, other Pods didn't pass that webhook.

The reason the didn't pass is because the pod-idendity-webhook wasn't yet up & running.

I've edited the webhook and set failurePolicy: Ignore, then I waited a bit for all the Pods to get running, and after that I've reverted the webhook to failurePolicy: Fail