Add recovery option for quorum loss.
masnax opened this issue · 2 comments
Similar to LXD's lxd cluster edit
command which can be used to repair a cluster that has lost all but one of its nodes, we need a similar mechanism to modify the local cluster configuration if we have lost quorum.
@MggMuggins As a first step, we should investigate whether go-dqlite supports something similar to lxd cluster edit
and lxd recover-from-quorum-loss
.
lxd recover-from-quorum-loss
currently calls the deprecated Node.Recover
to reset the dqlite raft log with only the current node as a member of the cluster. Microcluster should use ReconfigureMembership
instead.
Node.Recover
and ReconfigureMembership
invoke dqlite_node_recover_ext
. The comment block there indicates that the function should be called exactly once, after which the entire data directory for all remaining dqlite members should be completely replaced by the data dir from the member where dqlite_node_recover_ext
was run.
Unless I'm missing some dqlite/raft behavior/context, this isn't being done for recover-from-quorum-loss
(not an issue for 3-node clusters but anything larger would run into trouble; have anecdotal evidence but haven't actually done it). The docs for lxd cluster edit
indicate that the same yaml should be applied to all cluster members, not just one. Since removing nodes isn't allowed via edit
, my guess is that performing the edit on all nodes isn't problematic if each member has the same log before they're shut down, but just guessing here.
In terms of what microcluster should do, my feeling is that we should expose the functionality dqlite provides in allowing a reset to retain part of the cluster, something like:
func (m *MicroCluster) RecoverFromQuorumLoss(keepMembers []string) error
and read cluster.yaml
for node IDs etc.
I'm thinking that copying the database dir isn't something we can do in microcluster/microcloud since the DB can't be running during the reset process; I'm happy to be corrected here. Will look more tomorrow.