hashicorp/serf

Can serf recover a single cluster following a "long" network partition?

davidMcneil opened this issue · 2 comments

If there is a network partition for a long period of time can serf automatically recover? To define some terms:

  • long - longer than suspicion timeouts (ie all nodes across the partition are confirmed dead)
  • recover - combine the resulting two clusters into a single cluster

For example, there are four nodes A, B, C, and D. A network partition causes A and B to be isolated into their own cluster, and C and D are isolated into a separate cluster. Now the network is fixed. Will the two split clusters be able to recover and form a single cluster with all four nodes?

One could manually heal the split by adding a node with peers from both clusters.

Thanks for the awesome project!

"Suspicion" timeouts will turn into death("Dead") events, which the node in question could revoke it turning intself "Alive" again.
NetPartition will make this impossible, so the node is marked as dead.

after the serf node is marked as dead the member will be only keep around until the member list is cleaned up, which will remove dead nodes then. after that a manuel rejoining is required to reconnect the "Cluster" back together.

https://www.serf.io/docs/internals/gossip.html decribes the failure handling, but a look into the source can also help.

As always most of the settings, specially the times around memberlists, can be set to very high values to keep all noddes around for longer.

Any member who know other nodes not connected already to the rest will also be added as members to the list.

  • with snapshots: sure, the recovery timeout just needs to be set to a high value
  • Without Snapshots: As the member is removed from the lists, only a manuel rejoin can bring him and his friends back into the "Cluster"