Nodes not ready on raspberrypi

Question

Nodes not ready on raspberrypi

Closed this issue 6 years ago · 13 comments

hkoessler commented 7 years ago

OS running on Ansible host:

Linux Mint 18

Ansible Version (`ansible --version`):

2.5.1

Uploaded logs showing errors(`rak8s/.log/ansible.log`)

n/a

Raspberry Pi Hardware Version:

Raspi 3 B

Raspberry Pi OS & Version (`cat /etc/os-release`):

PRETTY_NAME="Raspbian GNU/Linux 9 (stretch)"
NAME="Raspbian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=raspbian
ID_LIKE=debian

Detailed description of the issue:

I've set up 3 raspis with the 2018-03-13-raspbian-stretch-lite.img
and the ansible scripts with tag 0.1.5 of this repo.
After a few reboots the kubectl works but
"sudo kubectl get nodes"
reports the master node and a worker node notReady.
on the master node "kubectl describe node ..." reports the following

runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized.

After that I directly on the master node tried "sudo kubeadm init" (just in order to check what happens)
and I get

WARNING: [init] Using Kubernetes version: v1.10.1
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
[WARNING SystemVerification]: docker version is greater than the most recently validated version. Docker version: 18.04.0-ce. Max validated version: 17.03
[WARNING FileExisting-crictl]: crictl not found in system path
Suggestion: go get github.com/kubernetes-incubator/cri-tools/cmd/crictl
[preflight] Some fatal errors occurred:
[ERROR Port-6443]: Port 6443 is in use
[ERROR Port-10250]: Port 10250 is in use
[ERROR Port-10251]: Port 10251 is in use
[ERROR Port-10252]: Port 10252 is in use
[ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
[ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[ERROR Port-2379]: Port 2379 is in use
[ERROR DirAvailable--var-lib-etcd]: /var/lib/etcd is not empty
CPU hardcapping unsupported

By the way: During the setup with the ansible scripts I also got the Bug #26
I then executed the regarding command directly on my master node and that succeeded.
After that I was able to rerun the playbook.
I am running the playbook from a laptop "outside" the raspis.

Answer 1 · 2018-04-26T13:02:59.000Z

Try the latest release and let me know how that goes, please: https://github.com/rak8s/rak8s/releases/tag/v0.2.0

Answer 2 · 2018-05-01T21:13:54.000Z

Tried to install with two freshly installed 2018-03-13-raspbian-stretch-lite.img on my raspis named raspic0 and raspic1. I now use nsible 2.5.2 on Linux Mint 18.2

"ansible-playbook cluster.yml" results in

/ TASK [common : Pass bridged IPv4 traffic to iptables'
\ chains] /

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

fatal: [raspic1]: FAILED! => {"changed": false, "msg": "Failed to reload sysctl: sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory\n"}
fatal: [raspic0]: FAILED! => {"changed": false, "msg": "Failed to reload sysctl: sysctl: cannot stat /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory\n"}

After that I stopped the ansible-playbook with CTRL+C
and re ran it again.

The Task [common: Pass bridged IPv4 traffic to iptables chains] succeeded after that.
But it failed at the Run Docker Install Script with

< TASK [kubeadm : Run Docker Install Script] >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

fatal: [raspic1]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 100, "stderr": "Shared connection to raspic1 closed.\r\n", "stdout": "# Executing docker install script, commit: 1d31602\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null\r\n+ sh -c curl -fsSL "https://download.docker.com/linux/raspbian/gpg\" | apt-key add -qq - >/dev/null\r\nWarning: apt-key output should not be parsed (stdout is not a terminal)\r\n+ sh -c echo "deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge" > /etc/apt/sources.list.d/docker.list\r\n+ [ raspbian = debian ]\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null\r\nE: Sub-process /usr/bin/dpkg returned an error code (1)\r\n", "stdout_lines": ["# Executing docker install script, commit: 1d31602", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null", "+ sh -c curl -fsSL "https://download.docker.com/linux/raspbian/gpg" | apt-key add -qq - >/dev/null", "Warning: apt-key output should not be parsed (stdout is not a terminal)", "+ sh -c echo "deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge" > /etc/apt/sources.list.d/docker.list", "+ [ raspbian = debian ]", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null", "E: Sub-process /usr/bin/dpkg returned an error code (1)"]}
fatal: [raspic0]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 100, "stderr": "Shared connection to raspic0 closed.\r\n", "stdout": "# Executing docker install script, commit: 1d31602\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null\r\n+ sh -c curl -fsSL "https://download.docker.com/linux/raspbian/gpg\" | apt-key add -qq - >/dev/null\r\nWarning: apt-key output should not be parsed (stdout is not a terminal)\r\n+ sh -c echo "deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge" > /etc/apt/sources.list.d/docker.list\r\n+ [ raspbian = debian ]\r\n+ sh -c apt-get update -qq >/dev/null\r\n+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null\r\nE: Sub-process /usr/bin/dpkg returned an error code (1)\r\n", "stdout_lines": ["# Executing docker install script, commit: 1d31602", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null", "+ sh -c curl -fsSL "https://download.docker.com/linux/raspbian/gpg" | apt-key add -qq - >/dev/null", "Warning: apt-key output should not be parsed (stdout is not a terminal)", "+ sh -c echo "deb [arch=armhf] https://download.docker.com/linux/raspbian stretch edge" > /etc/apt/sources.list.d/docker.list", "+ [ raspbian = debian ]", "+ sh -c apt-get update -qq >/dev/null", "+ sh -c apt-get install -y -qq --no-install-recommends docker-ce >/dev/null", "E: Sub-process /usr/bin/dpkg returned an error code (1)"]}

Answer 3 · 2018-05-01T22:00:21.000Z

After that I rebooted raspic0 and raspic1 manually.
After that a new "ansible-playbook cluster.yml" succeeded.

But the nodes are still NotReady.
If I log into the master (which is my raspic0) "sudo kubectl get nodes" responds with
NAME STATUS ROLES AGE VERSION
raspic0 NotReady master 6m v1.10.2
raspic1 NotReady 5m v1.10.2

additionally the master still shows the following message if queried with "sudo kubectl describe raspic0":

Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message

OutOfDisk False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Tue, 01 May 2018 21:59:06 +0000 Tue, 01 May 2018 21:52:12 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized. WARNING: CPU hardcapping unsupported

Answer 4 · 2018-05-01T22:01:46.000Z

That means that the latest master version didn't fix that problem.

Answer 5 · 2018-05-02T09:10:33.000Z

I have had this occasionally. Just reboot the cluster and it will work. Op wo 2 mei 2018 01:01 schreef hkoessler <notifications@github.com>:

…

That means that the latest master version didn't fix that problem. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJoVD1iRL_ektTy-EzB03KNLDba6LyP2ks5tuNtLgaJpZM4TlDcj> .

Answer 6 · 2018-05-02T12:59:55.000Z

One of the things we definitely need to work on is reliability and consistency. Sadly, I don't have a cluster to test with and the cluster I created with rak8s is actively in use doing things here for me. Pretty sure I'd be accosted for buying a three node cluster for testing. Chris Short https://chrisshort.net https://devopsish.com

…

On Wed, May 2, 2018 at 5:10 AM, Ted Sluis ***@***.***> wrote: I have had this occasionally. Just reboot the cluster and it will work. Op wo 2 mei 2018 01:01 schreef hkoessler ***@***.***>: > That means that the latest master version didn't fix that problem. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#28 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AJoVD1iRL_ektTy- EzB03KNLDba6LyP2ks5tuNtLgaJpZM4TlDcj> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABVB-dp28E-XFh1uSEGtmSLA6qKQ9J-kks5tuXgKgaJpZM4TlDcj> .

Answer 7 · 2018-05-02T13:54:40.000Z

To investigate this issue I have deployed a fresh single node cluster this afternoon 5x times. 1x time it failed on kubeadm init (timeout, probably caused by slow internet). All other attemps the playbook finished successful, but in one occasion the node was 'not ready'. In that case 'system status kubelet' reported 'network plugin is not ready: cni config uninitialized.' The issue was resolved after a reboot, just as I had seen before. One thing I noticed is, when I delete an existing cluster using 'kubeadm reset', I need to reboot the node (for this test I used only one node), otherwise a clean install (using 'ansible-playbook cluster.yaml') ends up with the 'node not ready' due to 'network plugin is not ready: cni config uninitialized.' In the case my test failed, I had forgotten to reboot the node after manually removing the cluster via 'kubeadm reset'. If you remove a cluster via "kubeadm reset", "ip add" still displays all the kubernetes weave networks:. ```` $ kubeadm reset [preflight] Running pre-flight checks. [reset] Stopping the kubelet service. [reset] Unmounting mounted directories in "/var/lib/kubelet" [reset] Removing kubernetes-managed containers. [reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd] [reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki] [reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf] $ ip address 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000 link/ether b8:27:eb:cf:d0:f3 brd ff:ff:ff:ff:ff:ff 3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether b8:27:eb:9a:85:a6 brd ff:ff:ff:ff:ff:ff inet 192.168.51.67/24 brd 192.168.51.255 scope global wlan0 valid_lft forever preferred_lft forever inet6 fe80::8fe7:a366:5c7b:1cd0/64 scope link valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:3c:da:1b:3a brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever 5: datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 96:6c:d8:62:af:8f brd ff:ff:ff:ff:ff:ff inet 169.254.148.180/16 brd 169.254.255.255 scope global datapath valid_lft forever preferred_lft forever inet6 fe80::c89b:530d:49c3:5f8c/64 scope link valid_lft forever preferred_lft forever 7: weave: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state DORMANT group default qlen 1000 link/ether 12:dd:ea:1e:56:87 brd ff:ff:ff:ff:ff:ff inet 10.32.0.1/12 brd 10.47.255.255 scope global weave valid_lft forever preferred_lft forever 8: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 66:1c:af:dc:30:32 brd ff:ff:ff:ff:ff:ff inet6 fe80::323d:1767:5891:a52b/64 scope link valid_lft forever preferred_lft forever 10: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master datapath state UP group default link/ether ca:10:1f:89:5d:d5 brd ff:ff:ff:ff:ff:ff inet 169.254.205.166/16 brd 169.254.255.255 scope global vethwe-datapath valid_lft forever preferred_lft forever inet6 fe80::c810:1fff:fe89:5dd5/64 scope link valid_lft forever preferred_lft forever 11: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default link/ether 3e:ad:99:80:83:c6 brd ff:ff:ff:ff:ff:ff inet 169.254.62.104/16 brd 169.254.255.255 scope global vethwe-bridge valid_lft forever preferred_lft forever inet6 fe80::3cad:99ff:fe80:83c6/64 scope link valid_lft forever preferred_lft forever 12: vxlan-6784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue master datapath state UNKNOWN group default qlen 1000 link/ether b2:7b:14:3c:1d:d6 brd ff:ff:ff:ff:ff:ff inet 169.254.5.32/16 brd 169.254.255.255 scope global vxlan-6784 valid_lft forever preferred_lft forever inet6 fe80::b07b:14ff:fe3c:1dd6/64 scope link valid_lft forever preferred_lft forever ```` After a reboot those networks will be gone:. ```` $ ip address 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000 link/ether b8:27:eb:cf:d0:f3 brd ff:ff:ff:ff:ff:ff 3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether b8:27:eb:9a:85:a6 brd ff:ff:ff:ff:ff:ff inet 192.168.51.67/24 brd 192.168.51.255 scope global wlan0 valid_lft forever preferred_lft forever inet6 fe80::8fe7:a366:5c7b:1cd0/64 scope link valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:3e:39:3a:54 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever ```` So my advise is to do reboot the nodes before you do a second install or in case you run into the "node not ready" due to "network plugin is not ready: cni config uninitialized" issue.

Answer 8 · 2018-05-02T15:43:45.000Z

If you could put that on the discourse site that'd be amazing! https://discourse.rak8s.io/ Chris Short https://chrisshort.net https://devopsish.com

…

On Wed, May 2, 2018 at 9:54 AM, Ted Sluis ***@***.***> wrote: To investigate this issue I have deployed a fresh single node cluster this afternoon 5x times. 1x time it failed on kubeadm init (timeout, probably caused by slow internet). All other attemps the playbook finished successful, but in one occasion the node was 'not ready'. In that case 'system status kubelet' reported 'network plugin is not ready: cni config uninitialized.' The issue was resolved after a reboot, just as I had seen before. One thing I noticed is, when I delete an existing cluster using 'kubeadm reset', I need to reboot the node (for this test I used only one node), otherwise a clean install (using 'ansible-playbook cluster.yaml') ends up with the 'node not ready' due to 'network plugin is not ready: cni config uninitialized.' In the case my test failed, I had forgotten to reboot the node after manually removing the cluster via 'kubeadm reset'. If you remove a cluster via "kubeadm reset", "up add" still displays all the kubernetes weave networks. After a reboot those will be gone. So my advise is to do reboot the nodes, before you do a second install or in case you run into the "node not ready" due to "network plugin is not ready: cni config uninitialized" issue. Op 2 mei 2018 16:00 schreef "Chris Short" ***@***.***>: One of the things we definitely need to work on is reliability and consistency. Sadly, I don't have a cluster to test with and the cluster I created with rak8s is actively in use doing things here for me. Pretty sure I'd be accosted for buying a three node cluster for testing. Chris Short https://chrisshort.net https://devopsish.com On Wed, May 2, 2018 at 5:10 AM, Ted Sluis ***@***.***> wrote: > I have had this occasionally. > Just reboot the cluster and it will work. > > Op wo 2 mei 2018 01:01 schreef hkoessler ***@***.***>: > > > That means that the latest master version didn't fix that problem. > > > > — > > You are receiving this because you are subscribed to this thread. > > Reply to this email directly, view it on GitHub > > <#28 (comment)>, or > mute > > the thread > > <https://github.com/notifications/unsubscribe-auth/AJoVD1iRL_ektTy- > EzB03KNLDba6LyP2ks5tuNtLgaJpZM4TlDcj> > > . > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#28 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ABVB-dp28E- XFh1uSEGtmSLA6qKQ9J-kks5tuXgKgaJpZM4TlDcj > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AJoVD5ka672JxQrrwNblCjWbpmQD0aZVks5tua3PgaJpZM4TlDcj> . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABVB-W2U6K3Xi-a-RiSz_5pD4h73jTMaks5tubqhgaJpZM4TlDcj> .

Answer 9 · 2018-05-06T18:23:47.000Z

Quote Chris-short: One of the things we definitely need to work on is reliability and
consistency. Sadly, I don't have a cluster to test with and the cluster I
created with rak8s is actively in use doing things here for me. Pretty sure
I'd be accosted for buying a three node cluster for testing.

Yes, testing is definitively important. I am looking into the Kubernetes end to end test:

I will try to set it up. I can run tests on a 3 node cluster.

Answer 10 · 2018-07-10T21:17:19.000Z

I Chris-short, hi tedsluis: I used the last version of the development branch and now all the nodes are running.
But if I try to deploy a pod/service then it doesn't resond.

eg the commands:
$ sudo kubectl run hypriot --image=hypriot/rpi-busybox-httpd --replicas=3 --port=80
$ sudo kubectl get endpoints hypriot

results in the following:
$ sudo kubectl get endpoints hypriot
NAME ENDPOINTS AGE
hypriot 172.30.1.6:80,172.30.2.4:80,172.30.3.5:80 11m

but "curl 172.30.1.6:80" doesn't give a result.

On the other hand checking the services results in
$ sudo kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hypriot ClusterIP 10.99.49.200 80/TCP 13m

That means that the cluster IP is in a completely different network than the endpoints.
Is that a normal behavior?
What should I check in order to find out why the container doesnt respond to the curl?

Answer 11 · 2018-11-02T03:44:06.000Z

I'm going to close this. Please grab the latest version and try again. If there are bugs, please submit them.

Answer 12 · 2018-11-20T19:52:07.000Z

i had the same issue where nodes always on "Not Ready" and i am on latest 1.12.2. any help

Answer 13 · 2018-11-20T20:13:53.000Z

Try an older version. https://github.com/rak8s/rak8s/tree/v0.2.1

OS running on Ansible host:

Ansible Version (ansible --version):

Uploaded logs showing errors(rak8s/.log/ansible.log)