Improve force-leave
badrabubker opened this issue · 3 comments
Hey,
Thanks for providing Serf. I had an Issue while running multiple instances of Serf
Setup:
I have infiniband interface ib0 and two Infiniband partitions with IPoIB : lets call them ib0.x and ib0.y
I have two serf agents running, each of them binds to one interface and uses the name of that interface i.e. ib0.x and ib0.y as discover for mDNS
in ib0.x cluster there are more nodes than in ib0.y let's call them A, B
It was working just fine until -for some reason i couldn't reproduce- A,B joined joined the cluster ib0.y and they were shown as a live . Checking the logs i could see that serf agent in ib0.y can't send gossip packets to A and B since they are actually not on the same network (one of them is ipv4 and the other is ipv6) yet they were shown as alive
Serf reachability however complains about missing acks from A and B which is correct
Steps to solve that issue :
I tried first to stop serf service in A,B the result was in ib0.x there were shown as left but in ib0.y one of them was shown as failed and the other as alive
So A, B were "stuck" in ib0.y cluster and i couldn't remove them even with force-leave the alive
node was still in leaving state for days
The only work around that worked is by stopping the serf service in all the nodes (200 nodes) in ib0.y and then starting them again
Since the scope of this issue is not to debug that problem since i can't reproduce it But it would be really nice to have a force-leave with a flag that can remove a node or (forget) a node
Sorry for that long issue and thanks a lot in advance
best
I think we improved this in the library for by adding RemoveFailedNodePrune
recently. I'm not sure if we exposed that command in the serf binary though. It would be relatively easy to plumb in as an option as we did for Consul if anyone is willing to make a PR.
That's right and there is already a flag for force-leave command in serf binary.
As i mentioned two nodes A,B joined the other cluster
First cluster uses ipv4 network and the second one uses ipv6
After they joined they were stucked
and shown as alive
not failed
so executing this command didn't help.
This is a snippet from logs
Jun 6 05:50:40 xxxxx serf[40807]: 2020/06/06 05:50:40 [ERR] memberlist: Failed to send ping: write udp [2a02:247f:301:4:2:0:a:b]:7946->192.168.250.1:7946: sendto: network is unreachable
--
Jun 6 05:50:43 xxxxx serf[40807]: 2020/06/06 05:50:43 [ERR] memberlist: Failed to send ping: write udp [2a02:247f:301:4:2:0:a:b]:7946->192.168.250.2:7946: sendto: network is unreachable
I see. Yeah it's certainly bad to accidentally join two Serf clusters and we have a number of recommended measures to prevent it including using different encryption keys for each.
In this case, if the nodes were still seen as alive
that must mean that they are not failing any health probes - i.e. they could actually talk across the cluster. I'm not sure if any other way Serf would see them as alive rather than failed
.
The best think to do currently if you do manage to accidentally join two clusters is to either shut one down (if possible i.e. it's a dev/testing cluster) or to firewall them to introduce a complete network partition. Then you can force-leave all of the other cluster nodes fro each cluster (with -prune to avoid waiting extra time).
I'm not sure how we could safely introduce a mechanism that allowed you to online split two merged clusters without this network separation - it would be much more complex than just another flag to force leave as you'd need to have all nodes in the cluster agree on which cluster they were a part of and delete all other nodes - if you just force a node to leave it would leave both clusters if they are joined etc.
Finally, if you actually did manage to see members stay "alive" even though they couldn't actually communicate this could be a bug in mixed ipv4 and v6 environments. I'm not quite sure how they could ever be alive though unless they are successfully exchanging at least TCP messages with peers - could it be that your v6 network is dual stack and so had no trouble sending data to the v4 IPs but the v4 nodes couldn't send data back to v6 IPs? That might explain.
Either way, what sort of UX would you want from an "improved force-leave" that would let you split the clusters? How did you imagine you'd map each node in the accidentally joined cluster back to a separate one without either shutting it down or causing it to be partitioned (i.e. failed) in the process?
Thanks for your input!