Cluster data plane fails after initial deploy
Cerebus opened this issue · 18 comments
Conditions:
- New kind cluster with kindnet
- meshnet-cni @v0.3.0 installed
Intermittently, Pods deployed immediately after meshnet come up with the cluster network unavailable. E.g., kube-prometheus-stack initializes with a Job, but it fails to talk to the API server:
> kubectl -n mimesis-data logs mimesis-mon-mimesis-data-admission-create-cslj5
W0829 17:07:15.999396 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
{"err":"Get \"https://10.96.0.1:443/api/v1/namespaces/mimesis-data/secrets/mimesis-mon-mimesis-data-admission\": dial tcp 10.96.0.1:443: connect: no route to host","level":"fatal","msg":"error getting secret","source":"k8s/k8s.go:109","time":"2021-08-29T17:07:19Z"}
When this condition occurs, it happens with all Pods. I can exec into a Pod and try to ping cluster-cidr addresses and all return no route to host.
I can sometimes kick networking over by generating some external network traffic (e.g., apt-get update
from the kindnet pod).
thanks for reporting this @Cerebus . Seems like the eth0
interface may have been set up.
When this does happen, is this issue persistent or does it only affect Pods that were deployed immediately after?
Can you document the steps to reproduce this?
and if you happen to catch this again, can you collect the output of ip addr && ip route
inside a Pod and journalctl logs from one of the kind nodes (something like docker exec journalctl -xn --no-pager)?
Reproduce:
- 3 pods in a triangle running alpine:latest
- All links are /31 addresses allocated from 192.168.0.0/16
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 1a:94:36:86:62:81 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.0.2/24 brd 10.244.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::1894:36ff:fe86:6281/64 scope link
valid_lft forever preferred_lft forever
8: eth1@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 22:09:dd:c0:5b:27 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet 192.168.0.0/31 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::2009:ddff:fec0:5b27/64 scope link
valid_lft forever preferred_lft forever
10: eth2@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 7e:e2:dc:25:db:6e brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet 192.168.0.2/31 scope global eth2
valid_lft forever preferred_lft forever
inet6 fe80::7ce2:dcff:fe25:db6e/64 scope link
valid_lft forever preferred_lft forever
10.244.0.0/24 via 10.244.0.1 dev eth0 src 10.244.0.2
10.244.0.1 dev eth0 scope link src 10.244.0.2
192.168.0.0/31 dev eth1 proto kernel scope link src 192.168.0.0
192.168.0.2/31 dev eth2 proto kernel scope link src 192.168.0.2
192.168.0.4/31 proto ospf metric 20
nexthop via 192.168.0.1 dev eth1 weight 1
nexthop via 192.168.0.3 dev eth2 weight 1
[deleted pinging coredns b/c I realized that's blocked anyway]
kubectl -n example get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
n1 1/2 Running 9 19m 10.244.0.2 mimesis-demo-control-plane <none> <none>
n2 1/2 Running 9 19m 10.244.0.3 mimesis-demo-control-plane <none> <none>
n3 1/2 Running 9 19m 10.244.0.4 mimesis-demo-control-plane <none> <none>
kubectl -n example exec n1 -- ping -c 1 -W 1 10.244.0.3
Defaulting container name to workload.
Use 'kubectl describe pod/n1 -n example' to see all of the containers in this pod.
PING 10.244.0.3 (10.244.0.3): 56 data bytes
--- 10.244.0.3 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1
Jounalctl returns a lot, natch, is there something in particular you're looking for?
Something caused the sidecars to crash, so I redeployed and now meshnet is working:
mimesis > kubectl -n example exec n1 -- ping 10.244.0.1
Defaulting container name to workload.
Use 'kubectl describe pod/n1 -n example' to see all of the containers in this pod.
PING 10.244.0.1 (10.244.0.1): 56 data bytes
64 bytes from 10.244.0.1: seq=0 ttl=64 time=0.123 ms
64 bytes from 10.244.0.1: seq=1 ttl=64 time=0.216 ms
I am seeing the same issue as well
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl exec -it foo -- /bin/bash
root@foo:/# kubectl get pods
Unable to connect to the server: dial tcp 10.96.0.1:443: connect: no route to host
root@foo:/# ip route
default via 10.244.0.1 dev eth0
10.244.0.0/24 via 10.244.0.1 dev eth0 src 10.244.0.2
10.244.0.1 dev eth0 scope link src 10.244.0.2
root@foo:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
3: sit0@NONE: mtu 1480 qdisc noop state DOWN group default qlen 1000
link/sit 0.0.0.0 brd 0.0.0.0
5: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 3e:da:18:4a:29:61 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.0.2/24 brd 10.244.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::3cda:18ff:fe4a:2961/64 scope link
valid_lft forever preferred_lft forever
root@foo:/#
marcus@muerto:~/go/src/github.com/networkop/meshnet-cni$ kubectl get services -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 443/TCP 3m15s
kube-system kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP,9153/TCP 3m13s
this is on k8s 1.22.0 running from inside kind
for example:
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
3node-host vm-1 1/1 Running 0 25s 10.244.0.3 kne-control-plane <none> <none>
3node-host vm-2 1/1 Running 0 25s 10.244.0.2 kne-control-plane <none> <none>
3node-host vm-3 1/1 Running 0 25s 10.244.0.4 kne-control-plane <none> <none>
kube-system coredns-78fcd69978-2wz7h 1/1 Running 0 85s 10.244.0.4 kne-control-plane <none> <none>
kube-system coredns-78fcd69978-vsphh 1/1 Running 0 85s 10.244.0.3 kne-control-plane <none> <none>
kube-system etcd-kne-control-plane 1/1 Running 0 99s 172.18.0.2 kne-control-plane <none> <none>
kube-system kindnet-79qbm 1/1 Running 0 86s 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-apiserver-kne-control-plane 1/1 Running 0 101s 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-controller-manager-kne-control-plane 1/1 Running 0 99s 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-proxy-rjtlx 1/1 Running 0 86s 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-scheduler-kne-control-plane 1/1 Running 0 100s 172.18.0.2 kne-control-plane <none> <none>
local-path-storage local-path-provisioner-85494db59d-497gj 1/1 Running 0 85s 10.244.0.2 kne-control-plane <none> <none>
meshnet meshnet-kxng5 1/1 Running 0 56s 172.18.0.2 kne-control-plane <none> <none>
metallb-system controller-6cc57c4567-qwhh6 1/1 Running 0 85s 10.244.0.5 kne-control-plane <none> <none>
metallb-system speaker-cmjgr 1/1 Running 0 79s 172.18.0.2 kne-control-plane <none> <none>
the pods that are started after the Deployment of meshnet appear to get ip's reassigned from the cluster space
I think I managed to reproduce the problem and found the issue:
ls /run/cni-ipam-state/
kindnet/ masterplugin/
Looks like meshnet renames the plugin which screws up the host-local IPAM cache.
this part has always troubled me. there's a bunch of jq-foo and lots of room for errors. need to check how others do it, e.g. multus
so my dev environment is kindnet (from inside kind) <- this is the main one I seeing this issue.
I realized why I just noticed this -
since in kind there are VERY few pods that get started before meshnet gets deployed specifically only 2 tasks get setup before it - so these get .2 / .3 normally in the cluster - so after that everything works as expected
Also normally we deploy a network topology right after this and since those pods never talk to the API server - I never noticed this
The issue came up as we have vendor's providing controllers for managing their own network pods in KNE those controllers do need to talk to the api server and since they now deployed right after meshnet but before the topology is pushed that is the error.
also say you kill those to pods and restart them you will the next ip's so again everything works.
so as a horrible workaround i can just fire up some pods as noops so they get past the "duplicate" ip assignment - I tested that out today and it works but is awful :)
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl get pod -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
2node-host vm-1 1/1 Running 0 8h 10.244.0.7 kne-control-plane <none> <none>
2node-host vm-2 1/1 Running 0 8h 10.244.0.8 kne-control-plane <none> <none>
default foo 1/1 Running 0 55s 10.244.0.9 kne-control-plane <none> <none>
kube-system coredns-78fcd69978-2wz7h 1/1 Running 0 13h 10.244.0.4 kne-control-plane <none> <none>
kube-system coredns-78fcd69978-vsphh 1/1 Running 0 13h 10.244.0.3 kne-control-plane <none> <none>
kube-system etcd-kne-control-plane 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
kube-system kindnet-79qbm 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-apiserver-kne-control-plane 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-controller-manager-kne-control-plane 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-proxy-rjtlx 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
kube-system kube-scheduler-kne-control-plane 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
local-path-storage local-path-provisioner-85494db59d-497gj 1/1 Running 0 13h 10.244.0.2 kne-control-plane <none> <none>
meshnet meshnet-kxng5 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
metallb-system controller-6cc57c4567-qwhh6 1/1 Running 0 13h 10.244.0.5 kne-control-plane <none> <none>
metallb-system speaker-cmjgr 1/1 Running 0 13h 172.18.0.2 kne-control-plane <none> <none>
the forbidden is expected as this pod don't have a clusterrolebinding to allow it (the connection is what matters)
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl exec -it -n 2node-host vm-1 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "vm-1" out of: vm-1, init-vm-1 (init)
root@vm-1:/# kubectl get pods
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:2node-host:default" cannot list resource "pods" in API group "" in the namespace "2node-host"
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl exec -it foo /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@foo:/# kubectl get pods
NAME READY STATUS RESTARTS AGE
foo 1/1 Running 0 50m
root@foo:/#
prod environment is calico - really not an issue here as we have much longer running instances so no ever really "observed" in this environment
yeah, and calico manages its own IPAM. I suspect this only affects kind users. Another straight-forward workaround is to kubectl delete --all pods --all-namespaces
right after meshnet is installed. This should force all IPs to be re-allocated.
But the right solution is to update the entrypoint script. I'll try to work up some courage to approach it some time next week.
looking at it now - I think it maybe super simple
root@kne-control-plane:/etc/cni/net.d# cat 10-kindnet.conflist cat 00-meshnet.conf
{
"cniVersion": "0.3.1",
"name": "kindnet",
"plugins": [
{
"type": "ptp",
"ipMasq": false,
"ipam": {
"type": "host-local",
"dataDir": "/run/cni-ipam-state",
"routes": [
{ "dst": "0.0.0.0/0" }
],
"ranges": [
[ { "subnet": "10.244.0.0/24" } ]
]
}
,
"mtu": 1500
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
{
"cniVersion": "0.2.0",
"name": "meshnet_network",
"type": "meshnet",
"delegate": {
"type": "ptp",
"ipMasq": false,
"ipam": {
"type": "host-local",
"dataDir": "/run/cni-ipam-state",
"routes": [
{
"dst": "0.0.0.0/0"
}
],
"ranges": [
[
{
"subnet": "10.244.0.0/24"
}
]
]
},
"mtu": 1500,
"name": "masterplugin"
}
}
i think if we just set name back to the orginal delegated plugin name it would then use the same ipam db
yep, that's the solution I had in mind. It should be a one-line change here:
https://github.com/networkop/meshnet-cni/blob/master/docker/entrypoint.sh#L26
I was also playing with a slightly different approach to CNI configuration handling here
https://github.com/networkop/meshnet-cni/tree/test-cni-chaining
I've moved all CNI install/uninstall to meshnetd
and doing the parsing, injection entirely in Go code. This version of code also includes the refactoring of meshnet CNI config to use chaining instead of delegation (I can't remember why I chose delegation in the first place).
This code is not thoroughly tested and can only parse conflist
CNI files but it's a first step. wdyt @mhines01 @Cerebus ?
looks pretty good so far - will patch to it and see if it works with kind for my case
hmm still having a problem with a container which is not meshnet still still isn't initializing:
marcus@muerto:~/go/src/github.com/google/kne/kne_cli$ kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default foo 0/1 ContainerCreating 0 42s <none> foo-control-plane <none> <none>
kube-system coredns-558bd4d5db-d2bqj 1/1 Running 0 4m20s 10.244.0.4 foo-control-plane <none> <none>
kube-system coredns-558bd4d5db-hlrsh 1/1 Running 0 4m20s 10.244.0.2 foo-control-plane <none> <none>
kube-system etcd-foo-control-plane 1/1 Running 0 4m31s 172.18.0.2 foo-control-plane <none> <none>
time="2021-09-12T23:48:40Z" level=info msg="Processing ADD POD in namespace default"
time="2021-09-12T23:48:40Z" level=info msg="Attempting to connect to local meshnet daemon"
time="2021-09-12T23:48:40Z" level=info msg="Retrieving local pod information from meshnet daemon"
time="2021-09-12T23:48:40Z" level=info msg="Pod foo:default was not a topology pod returning"
time="2021-09-12T23:48:40Z" level=info msg="meshnet cni call successful"
time="2021-09-12T23:48:40Z" level=info msg="Processing DEL request: foo"
time="2021-09-12T23:48:40Z" level=info msg="Retrieving pod's metadata from meshnet daemon"
time="2021-09-12T23:48:40Z" level=info msg="Pod default:foo is not topology returning"
is there supposed to a returned json from the cni call even it is nothing?
got it #29
meshnet pod
marcus@muerto:~/go/src/github.com/networkop/meshnet-cni$ kubectl exec -it -n 2node-host vm-1 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "vm-1" out of: vm-1, init-vm-1 (init)
root@vm-1:/# kubectl get pods -A
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:2node-host:default" cannot list resource "pods" in API group "" at the cluster scope
root@vm-1:/# exit
exit
command terminated with exit code 1
non meshnet pod
marcus@muerto:~/go/src/github.com/networkop/meshnet-cni$ kubectl exec -it foo /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@foo:/# kubectl get pods
NAME READY STATUS RESTARTS AGE
foo 1/1 Running 0 6m39s
root@foo:/#
great, thanks @mhines01, just merged your PR.
Since kubelet supports CNI spec 0.4.0 and chains were introduced in 0.3.0, I think it should be safe to replace the delegation design with chaining.
One last thing I'd like to add is the support for non-conflist config files which should be fairly simple, similar to what kubelet is doing here.