Add reconciliation for grpc-wires
alexmasi opened this issue · 7 comments
When a grpc-wire enabled meshnet pod in a node restarts (due to OOM / Error, etc.) the grpc-wire info (wire/handler maps) is not persisted or reconciled on restart.
meshnet-cni/daemon/grpcwire/grpcwire.go
Line 143 in d3ae648
This leads to errors like the following:
SendToOnce (wire id - 77): Could not find local handle. err:interface 77 is not active
stemming from:
meshnet-cni/daemon/grpcwire/grpcwire.go
Line 254 in d3ae648
To make grpc-wire add on more resilient, reconciliation should be added (likely using the topology CRD)
Testing the grpc wire reconciliation using a 150 node topology on KNE across 2 workers. Getting a few issues:
- a handful of the nodes do not get passed the Init state with the following message:
Warning FailedCreatePodSandBox 8m7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "18635462e778e6c2ea1f99f6addc15f8ec8e4290cd742b47b8eaa69d782a7db7" network for pod "r96": networkPlugin cni failed to set up pod "r96_ceos-150" network: plugin type="meshnet" name="meshnet" failed (add): rpc error: code = Unknown desc = Topology.networkop.co.uk "r96" is invalid: status.skipped[0]: Invalid value: "object": status.skipped[0] in body must be of type string: "object"
- seeing several warnings like the following:
I0530 23:15:54.997458 20398 topo.go:403] Creating topology for meshnet node r5
W0530 23:15:55.161037 20398 warnings.go:70] unknown field "status.container_id"
- I am not seeing the CRD
$ kubectl get gwirekobjs -A
error: the server doesn't have a resource type "gwirekobjs"
@alexmasi
It looks like old crd yamls has been used to deploy meshent with newer meshnet binary. So the CRD in definition in K8S and the CRD the binary supports are not in sync. Please take the latest yaml (manifest folder) from master branch for deployment and let us know it solves the issue.
Thanks Kingshuk, thats a mistake on my end. The grpc wire reconciliation appears to be working now. When I delete a meshnet pod it reloads with full information about the already created links. I appreciate the implementation!
However there is a separate issue with meshnet reconciliation in general. I tried deleting/recreating a pod during the topology creation and it mostly works except some of the topologies (in this case init containers for several of the router pods) get stuck waiting:
$ kubectl get pods -A -o wide | grep Init
ceos-150 r111 0/1 Init:0/1 0 37m 10.244.2.249 alexmasi-worker-2 <none> <none>
ceos-150 r112 0/1 Init:0/1 0 42m 10.244.1.217 alexmasi-worker-1 <none> <none>
ceos-150 r113 0/1 Init:0/1 0 41m 10.244.1.233 alexmasi-worker-1 <none> <none>
ceos-150 r124 0/1 Init:0/1 0 43m 10.244.1.213 alexmasi-worker-1 <none> <none>
ceos-150 r125 0/1 Init:0/1 0 42m 10.244.1.218 alexmasi-worker-1 <none> <none>
ceos-150 r126 0/1 Init:0/1 0 41m 10.244.1.229 alexmasi-worker-1 <none> <none>
ceos-150 r5 0/1 Init:0/1 0 44m 10.244.1.205 alexmasi-worker-1 <none> <none>
ceos-150 r6 0/1 Init:0/1 0 42m 10.244.1.216 alexmasi-worker-1 <none> <none>
ceos-150 r7 0/1 Init:0/1 0 41m 10.244.1.235 alexmasi-worker-1 <none> <none>
$ kubectl logs r5 -n ceos-150 init-r5 | tail -1
Connected 2 interfaces out of 3
$ kubectl logs r6 -n ceos-150 init-r6 | tail -1
Connected 1 interfaces out of 3
$ kubectl logs r7 -n ceos-150 init-r7 | tail -1
Connected 2 interfaces out of 3
Note all but one of these cases happens on the worker node where the meshnet pod was deleted mid topology creation. Did you come across this issue in your testing @kingshukdev ?
@alexmasi glad to know that recon worked.
I can think of few tricky situation if meashnet daemon is restarted during topology creation. It is very very time sensitive - meshnet daemon is not available but K8S is trying to create the next pod. If the meshnet daemon come up fast before K8S retries then it will go through.
How are you restating meshent daemon - is it "kill -9 pid" ? Once we know how you are restarting then we can try playing with that.
kubectl delete pod meshnet-****** -n meshnet
then k8 will auto bring up a new pod to match the intent
There seems to be a bug in #80. In my single-node cluster, I get:
time="2024-04-05T11:57:18-05:00" level=error msg="failed to run meshnet cni: <nil>"
time="2024-04-05T11:57:58-05:00" level=error msg="Add[c]: Failed to set a skipped flag on peer a"
For all pods after the first two or three pods.
> k get pods
NAME READY STATUS RESTARTS AGE
a 0/2 Init:0/1 0 24s
b 0/2 Init:0/1 0 24s
c 0/2 Init:0/1 0 24s
d 0/2 Init:0/1 0 24s
aa 2/2 Running 0 23s
dd 2/2 Running 0 23s
(Pods stick in Init b/c I have an initContainer that waits on all the interfaces to be added. Since the CNI client is failing, this initContainer never exits.)
ETA: In this deployment, Pods a/b/c/d are linked in a "diamond" network (peers are a-b, b-d, a-c, c-d, and b-c), and Pods aa/dd are linked to only one peer. So I speculate that this has something to do with Pods with multiple peers. More testing needed.