hashicorp/serf

Improve force-leave

badrabubker opened this issue · 3 comments

Hey,
Thanks for providing Serf. I had an Issue while running multiple instances of Serf

Setup:
I have infiniband interface ib0 and two Infiniband partitions with IPoIB : lets call them ib0.x and ib0.y
I have two serf agents running, each of them binds to one interface and uses the name of that interface i.e. ib0.x and ib0.y as discover for mDNS

in ib0.x cluster there are more nodes than in ib0.y let's call them A, B

It was working just fine until -for some reason i couldn't reproduce- A,B joined joined the cluster ib0.y and they were shown as a live . Checking the logs i could see that serf agent in ib0.y can't send gossip packets to A and B since they are actually not on the same network (one of them is ipv4 and the other is ipv6) yet they were shown as alive
Serf reachability however complains about missing acks from A and B which is correct

Steps to solve that issue :
I tried first to stop serf service in A,B the result was in ib0.x there were shown as left but in ib0.y one of them was shown as failed and the other as alive

So A, B were "stuck" in ib0.y cluster and i couldn't remove them even with force-leave the alive node was still in leaving state for days

The only work around that worked is by stopping the serf service in all the nodes (200 nodes) in ib0.y and then starting them again
Since the scope of this issue is not to debug that problem since i can't reproduce it But it would be really nice to have a force-leave with a flag that can remove a node or (forget) a node

Sorry for that long issue and thanks a lot in advance

best

banks commented

I think we improved this in the library for by adding RemoveFailedNodePrune recently. I'm not sure if we exposed that command in the serf binary though. It would be relatively easy to plumb in as an option as we did for Consul if anyone is willing to make a PR.

That's right and there is already a flag for force-leave command in serf binary.
As i mentioned two nodes A,B joined the other cluster
First cluster uses ipv4 network and the second one uses ipv6
After they joined they were stucked and shown as alive not failed so executing this command didn't help.
This is a snippet from logs

Jun  6 05:50:40 xxxxx serf[40807]:     2020/06/06 05:50:40 [ERR] memberlist: Failed to send ping: write udp [2a02:247f:301:4:2:0:a:b]:7946->192.168.250.1:7946: sendto: network is unreachable
--
Jun  6 05:50:43 xxxxx serf[40807]:     2020/06/06 05:50:43 [ERR] memberlist: Failed to send ping: write udp [2a02:247f:301:4:2:0:a:b]:7946->192.168.250.2:7946: sendto: network is unreachable

banks commented

I see. Yeah it's certainly bad to accidentally join two Serf clusters and we have a number of recommended measures to prevent it including using different encryption keys for each.

In this case, if the nodes were still seen as alive that must mean that they are not failing any health probes - i.e. they could actually talk across the cluster. I'm not sure if any other way Serf would see them as alive rather than failed.

The best think to do currently if you do manage to accidentally join two clusters is to either shut one down (if possible i.e. it's a dev/testing cluster) or to firewall them to introduce a complete network partition. Then you can force-leave all of the other cluster nodes fro each cluster (with -prune to avoid waiting extra time).

I'm not sure how we could safely introduce a mechanism that allowed you to online split two merged clusters without this network separation - it would be much more complex than just another flag to force leave as you'd need to have all nodes in the cluster agree on which cluster they were a part of and delete all other nodes - if you just force a node to leave it would leave both clusters if they are joined etc.

Finally, if you actually did manage to see members stay "alive" even though they couldn't actually communicate this could be a bug in mixed ipv4 and v6 environments. I'm not quite sure how they could ever be alive though unless they are successfully exchanging at least TCP messages with peers - could it be that your v6 network is dual stack and so had no trouble sending data to the v4 IPs but the v4 nodes couldn't send data back to v6 IPs? That might explain.

Either way, what sort of UX would you want from an "improved force-leave" that would let you split the clusters? How did you imagine you'd map each node in the accidentally joined cluster back to a separate one without either shutting it down or causing it to be partitioned (i.e. failed) in the process?

Thanks for your input!