wtsi-hgi/hgi-systems

Consul not recoving after members have left

Closed this issue · 3 comments

The remaining node in the cluster is unable to elect a leader as it is too busy trying to contact members who have left.

$ consul monitor
2017/06/12 15:21:08 [INFO] raft: Node at 172.27.84.45:8300 [Candidate] entering Candidate state in term 4185
2017/06/12 15:21:08 [ERR] raft: Failed to make RequestVote RPC to {Voter 172.27.84.47:8300 172.27.84.47:8300}: dial tcp 172.27.84.47:8300: getsockopt: connection refused
2017/06/12 15:21:11 [ERR] raft: Failed to make RequestVote RPC to {Voter 192.168.102.78:8300 192.168.102.78:8300}: dial tcp 192.168.102.78:8300: getsockopt: no route to host
2017/06/12 15:21:11 [ERR] raft: Failed to make RequestVote RPC to {Voter 172.27.84.41:8300 172.27.84.41:8300}: dial tcp 172.27.84.41:8300: getsockopt: no route to host
2017/06/12 15:21:11 [ERR] agent: coordinate update error: No cluster leader
2017/06/12 15:21:16 [WARN] raft: Election timeout reached, restarting election
$ consul members
Node                        Address              Status  Type    Build  Protocol  DC
172.27.81.80                192.168.102.60:8301  left    client  0.8.3  2         delta-hgi-tenant
consul-server-delta-hgi-01  172.27.84.45:8301    alive   server  0.7.5  2         delta-hgi-tenant
consul-server-delta-hgi-02  172.27.84.47:8301    left    server  0.7.5  2         delta-hgi-tenant
consul-server-delta-hgi-03  192.168.102.78:8301  left    server  0.8.3  2         delta-hgi-tenant

A three-server consul cluster cannot tolerate two server failures: https://www.consul.io/docs/internals/consensus.html#deployment_table

So this is the correct behaviour if two nodes are gone from a three-node cluster.

It should have continued working after losing a single node but not now that you've lost two. At this point I think you need to tell the remaining node that it is the only node in existence (by telling it that it is in a single-server cluster).

My expectation would be if only one of the three servers fails, it can be brought back online either with the existing data or after having completely rebuilt from scratch. As long as it has the same IP, it should be able to re-join the other two nodes that are still running and return to a fully working state. We should not need to manually recover using peers.json unless we lose quorum (so more than one server out of three).

The cluster was restored in this situation manually using the Duplicity backups.

The 3rd node could not be restored initially due to #34. Attempts to work around this problem resulted in the accidental upgrade of consul on the 2 remaining nodes from 0.7.5 -> 0.8.3. A breaking change in this "minor" update (https://www.consul.io/docs/upgrade-specific.html#version-8-acls-are-now-opt-out) took out the remainder of the cluster. Downgrading consul from 0.8.3 back to 0.7.5 stopped consul from functioning correctly, probably due upgrades to the data by consul 0.8.3.