networkop/meshnet-cni

Cluster data plane fails after initial deploy

Cerebus opened this issue · 18 comments

Conditions:

  • New kind cluster with kindnet
  • meshnet-cni @v0.3.0 installed

Intermittently, Pods deployed immediately after meshnet come up with the cluster network unavailable. E.g., kube-prometheus-stack initializes with a Job, but it fails to talk to the API server:

> kubectl -n mimesis-data logs mimesis-mon-mimesis-data-admission-create-cslj5        
W0829 17:07:15.999396       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"err":"Get \"https://10.96.0.1:443/api/v1/namespaces/mimesis-data/secrets/mimesis-mon-mimesis-data-admission\": dial tcp 10.96.0.1:443: connect: no route to host","level":"fatal","msg":"error getting secret","source":"k8s/k8s.go:109","time":"2021-08-29T17:07:19Z"}

When this condition occurs, it happens with all Pods. I can exec into a Pod and try to ping cluster-cidr addresses and all return no route to host.

I can sometimes kick networking over by generating some external network traffic (e.g., apt-get update from the kindnet pod).

thanks for reporting this @Cerebus . Seems like the eth0 interface may have been set up.
When this does happen, is this issue persistent or does it only affect Pods that were deployed immediately after?
Can you document the steps to reproduce this?
and if you happen to catch this again, can you collect the output of ip addr && ip route inside a Pod and journalctl logs from one of the kind nodes (something like docker exec journalctl -xn --no-pager)?

Reproduce:

  • 3 pods in a triangle running alpine:latest
  • All links are /31 addresses allocated from 192.168.0.0/16
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 1a:94:36:86:62:81 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.0.2/24 brd 10.244.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::1894:36ff:fe86:6281/64 scope link 
       valid_lft forever preferred_lft forever
8: eth1@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 22:09:dd:c0:5b:27 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 192.168.0.0/31 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::2009:ddff:fec0:5b27/64 scope link 
       valid_lft forever preferred_lft forever
10: eth2@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 7e:e2:dc:25:db:6e brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet 192.168.0.2/31 scope global eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::7ce2:dcff:fe25:db6e/64 scope link 
       valid_lft forever preferred_lft forever
10.244.0.0/24 via 10.244.0.1 dev eth0 src 10.244.0.2 
10.244.0.1 dev eth0 scope link src 10.244.0.2 
192.168.0.0/31 dev eth1 proto kernel scope link src 192.168.0.0 
192.168.0.2/31 dev eth2 proto kernel scope link src 192.168.0.2 
192.168.0.4/31 proto ospf metric 20 
	nexthop via 192.168.0.1 dev eth1 weight 1 
	nexthop via 192.168.0.3 dev eth2 weight 1 

[deleted pinging coredns b/c I realized that's blocked anyway]

kubectl -n example get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP           NODE                         NOMINATED NODE   READINESS GATES
n1     1/2     Running   9          19m   10.244.0.2   mimesis-demo-control-plane   <none>           <none>
n2     1/2     Running   9          19m   10.244.0.3   mimesis-demo-control-plane   <none>           <none>
n3     1/2     Running   9          19m   10.244.0.4   mimesis-demo-control-plane   <none>           <none>
kubectl -n example exec n1 -- ping -c 1 -W 1 10.244.0.3
Defaulting container name to workload.
Use 'kubectl describe pod/n1 -n example' to see all of the containers in this pod.
PING 10.244.0.3 (10.244.0.3): 56 data bytes

--- 10.244.0.3 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1

Jounalctl returns a lot, natch, is there something in particular you're looking for?

Something caused the sidecars to crash, so I redeployed and now meshnet is working:

mimesis > kubectl -n example exec n1 -- ping 10.244.0.1
Defaulting container name to workload.
Use 'kubectl describe pod/n1 -n example' to see all of the containers in this pod.
PING 10.244.0.1 (10.244.0.1): 56 data bytes
64 bytes from 10.244.0.1: seq=0 ttl=64 time=0.123 ms
64 bytes from 10.244.0.1: seq=1 ttl=64 time=0.216 ms

I am seeing the same issue as well

marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl exec -it foo -- /bin/bash
root@foo:/# kubectl get pods
Unable to connect to the server: dial tcp 10.96.0.1:443: connect: no route to host
root@foo:/# ip route
default via 10.244.0.1 dev eth0
10.244.0.0/24 via 10.244.0.1 dev eth0 src 10.244.0.2
10.244.0.1 dev eth0 scope link src 10.244.0.2
root@foo:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
3: sit0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
5: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 3e:da:18:4a:29:61 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.0.2/24 brd 10.244.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::3cda:18ff:fe4a:2961/64 scope link
valid_lft forever preferred_lft forever
root@foo:/#

marcus@muerto:~/go/src/github.com/networkop/meshnet-cni$ kubectl get services -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 443/TCP 3m15s
kube-system kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP,9153/TCP 3m13s

this is on k8s 1.22.0 running from inside kind

for example:

marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl get pods -A -o wide
NAMESPACE            NAME                                        READY   STATUS    RESTARTS   AGE    IP           NODE                NOMINATED NODE   READINESS GATES
3node-host           vm-1                                        1/1     Running   0          25s    10.244.0.3   kne-control-plane   <none>           <none>
3node-host           vm-2                                        1/1     Running   0          25s    10.244.0.2   kne-control-plane   <none>           <none>
3node-host           vm-3                                        1/1     Running   0          25s    10.244.0.4   kne-control-plane   <none>           <none>
kube-system          coredns-78fcd69978-2wz7h                    1/1     Running   0          85s    10.244.0.4   kne-control-plane   <none>           <none>
kube-system          coredns-78fcd69978-vsphh                    1/1     Running   0          85s    10.244.0.3   kne-control-plane   <none>           <none>
kube-system          etcd-kne-control-plane                      1/1     Running   0          99s    172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kindnet-79qbm                               1/1     Running   0          86s    172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-apiserver-kne-control-plane            1/1     Running   0          101s   172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-controller-manager-kne-control-plane   1/1     Running   0          99s    172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-proxy-rjtlx                            1/1     Running   0          86s    172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-scheduler-kne-control-plane            1/1     Running   0          100s   172.18.0.2   kne-control-plane   <none>           <none>
local-path-storage   local-path-provisioner-85494db59d-497gj     1/1     Running   0          85s    10.244.0.2   kne-control-plane   <none>           <none>
meshnet              meshnet-kxng5                               1/1     Running   0          56s    172.18.0.2   kne-control-plane   <none>           <none>
metallb-system       controller-6cc57c4567-qwhh6                 1/1     Running   0          85s    10.244.0.5   kne-control-plane   <none>           <none>
metallb-system       speaker-cmjgr                               1/1     Running   0          79s    172.18.0.2   kne-control-plane   <none>           <none>

the pods that are started after the Deployment of meshnet appear to get ip's reassigned from the cluster space

@mhines01 @Cerebus what CNI plugins are you using? Can you compare the config from /etc/cni/net.d/ before and after meshnet installation?

I think I managed to reproduce the problem and found the issue:

 ls /run/cni-ipam-state/
kindnet/      masterplugin/

Looks like meshnet renames the plugin which screws up the host-local IPAM cache.

this part has always troubled me. there's a bunch of jq-foo and lots of room for errors. need to check how others do it, e.g. multus

so my dev environment is kindnet (from inside kind) <- this is the main one I seeing this issue.

I realized why I just noticed this -
since in kind there are VERY few pods that get started before meshnet gets deployed specifically only 2 tasks get setup before it - so these get .2 / .3 normally in the cluster - so after that everything works as expected

Also normally we deploy a network topology right after this and since those pods never talk to the API server - I never noticed this

The issue came up as we have vendor's providing controllers for managing their own network pods in KNE those controllers do need to talk to the api server and since they now deployed right after meshnet but before the topology is pushed that is the error.

also say you kill those to pods and restart them you will the next ip's so again everything works.
so as a horrible workaround i can just fire up some pods as noops so they get past the "duplicate" ip assignment - I tested that out today and it works but is awful :)

marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl get pod -A -o wide
NAMESPACE            NAME                                        READY   STATUS    RESTARTS   AGE   IP           NODE                NOMINATED NODE   READINESS GATES
2node-host           vm-1                                        1/1     Running   0          8h    10.244.0.7   kne-control-plane   <none>           <none>
2node-host           vm-2                                        1/1     Running   0          8h    10.244.0.8   kne-control-plane   <none>           <none>
default              foo                                         1/1     Running   0          55s   10.244.0.9   kne-control-plane   <none>           <none>
kube-system          coredns-78fcd69978-2wz7h                    1/1     Running   0          13h   10.244.0.4   kne-control-plane   <none>           <none>
kube-system          coredns-78fcd69978-vsphh                    1/1     Running   0          13h   10.244.0.3   kne-control-plane   <none>           <none>
kube-system          etcd-kne-control-plane                      1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kindnet-79qbm                               1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-apiserver-kne-control-plane            1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-controller-manager-kne-control-plane   1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-proxy-rjtlx                            1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
kube-system          kube-scheduler-kne-control-plane            1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
local-path-storage   local-path-provisioner-85494db59d-497gj     1/1     Running   0          13h   10.244.0.2   kne-control-plane   <none>           <none>
meshnet              meshnet-kxng5                               1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>
metallb-system       controller-6cc57c4567-qwhh6                 1/1     Running   0          13h   10.244.0.5   kne-control-plane   <none>           <none>
metallb-system       speaker-cmjgr                               1/1     Running   0          13h   172.18.0.2   kne-control-plane   <none>           <none>

the forbidden is expected as this pod don't have a clusterrolebinding to allow it (the connection is what matters)

marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl exec -it -n 2node-host vm-1 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "vm-1" out of: vm-1, init-vm-1 (init)
root@vm-1:/# kubectl get pods
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:2node-host:default" cannot list resource "pods" in API group "" in the namespace "2node-host"
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl exec -it foo /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@foo:/# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
foo    1/1     Running   0          50m
root@foo:/# 

prod environment is calico - really not an issue here as we have much longer running instances so no ever really "observed" in this environment

yeah, and calico manages its own IPAM. I suspect this only affects kind users. Another straight-forward workaround is to kubectl delete --all pods --all-namespaces right after meshnet is installed. This should force all IPs to be re-allocated.
But the right solution is to update the entrypoint script. I'll try to work up some courage to approach it some time next week.

looking at it now - I think it maybe super simple

root@kne-control-plane:/etc/cni/net.d# cat 10-kindnet.conflist cat 00-meshnet.conf

{
	"cniVersion": "0.3.1",
	"name": "kindnet",
	"plugins": [
	{
		"type": "ptp",
		"ipMasq": false,
		"ipam": {
			"type": "host-local",
			"dataDir": "/run/cni-ipam-state",
			"routes": [
				
				
				{ "dst": "0.0.0.0/0" }
			],
			"ranges": [
				
				
				[ { "subnet": "10.244.0.0/24" } ]
			]
		}
		,
		"mtu": 1500
		
	},
	{
		"type": "portmap",
		"capabilities": {
			"portMappings": true
		}
	}
	]
}
{
  "cniVersion": "0.2.0",
  "name": "meshnet_network",
  "type": "meshnet",
  "delegate": {
    "type": "ptp",
    "ipMasq": false,
    "ipam": {
      "type": "host-local",
      "dataDir": "/run/cni-ipam-state",
      "routes": [
        {
          "dst": "0.0.0.0/0"
        }
      ],
      "ranges": [
        [
          {
            "subnet": "10.244.0.0/24"
          }
        ]
      ]
    },
    "mtu": 1500,
    "name": "masterplugin"
  }
}

i think if we just set name back to the orginal delegated plugin name it would then use the same ipam db

yep, that's the solution I had in mind. It should be a one-line change here:
https://github.com/networkop/meshnet-cni/blob/master/docker/entrypoint.sh#L26

I was also playing with a slightly different approach to CNI configuration handling here
https://github.com/networkop/meshnet-cni/tree/test-cni-chaining
I've moved all CNI install/uninstall to meshnetd and doing the parsing, injection entirely in Go code. This version of code also includes the refactoring of meshnet CNI config to use chaining instead of delegation (I can't remember why I chose delegation in the first place).
This code is not thoroughly tested and can only parse conflist CNI files but it's a first step. wdyt @mhines01 @Cerebus ?

looks pretty good so far - will patch to it and see if it works with kind for my case

hmm still having a problem with a container which is not meshnet still still isn't initializing:

marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl get pods -A -o wide
NAMESPACE            NAME                                        READY   STATUS              RESTARTS   AGE     IP           NODE                NOMINATED NODE   READINESS GATES
default              foo                                         0/1     ContainerCreating   0          42s     <none>       foo-control-plane   <none>           <none>
kube-system          coredns-558bd4d5db-d2bqj                    1/1     Running             0          4m20s   10.244.0.4   foo-control-plane   <none>           <none>
kube-system          coredns-558bd4d5db-hlrsh                    1/1     Running             0          4m20s   10.244.0.2   foo-control-plane   <none>           <none>
kube-system          etcd-foo-control-plane                      1/1     Running             0          4m31s   172.18.0.2   foo-control-plane   <none>           <none>
time="2021-09-12T23:48:40Z" level=info msg="Processing ADD POD in namespace default"
time="2021-09-12T23:48:40Z" level=info msg="Attempting to connect to local meshnet daemon"
time="2021-09-12T23:48:40Z" level=info msg="Retrieving local pod information from meshnet daemon"
time="2021-09-12T23:48:40Z" level=info msg="Pod foo:default was not a topology pod returning"
time="2021-09-12T23:48:40Z" level=info msg="meshnet cni call successful"
time="2021-09-12T23:48:40Z" level=info msg="Processing DEL request: foo"
time="2021-09-12T23:48:40Z" level=info msg="Retrieving pod's metadata from meshnet daemon"
time="2021-09-12T23:48:40Z" level=info msg="Pod default:foo is not topology returning"

is there supposed to a returned json from the cni call even it is nothing?

got it #29

meshnet pod

marcus@muerto:~/go/src/github.com/networkop/meshnet-cni$ kubectl exec -it -n 2node-host vm-1 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "vm-1" out of: vm-1, init-vm-1 (init)
root@vm-1:/# kubectl get pods -A
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:2node-host:default" cannot list resource "pods" in API group "" at the cluster scope
root@vm-1:/# exit
exit
command terminated with exit code 1

non meshnet pod

marcus@muerto:~/go/src/github.com/networkop/meshnet-cni$ kubectl exec -it foo /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@foo:/# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
foo    1/1     Running   0          6m39s
root@foo:/# 

great, thanks @mhines01, just merged your PR.
Since kubelet supports CNI spec 0.4.0 and chains were introduced in 0.3.0, I think it should be safe to replace the delegation design with chaining.
One last thing I'd like to add is the support for non-conflist config files which should be fairly simple, similar to what kubelet is doing here.