hashicorp/raft

Cluster unable to elect a leader after restarting a follower and stopping the leader while the follower is down

thomacr opened this issue · 2 comments

In a three-node Consul cluster, with server nodes 0,1 and 2, if I run the following test, the cluster cannot elect a leader:

  • Stop one follower node. For example, say that node 0 is the leader. I stop node 1.
  • Allow enough time for the leader to tell the other follower (node 2) that node 1 has left the cluster.
  • Stop the leader, in this case, node 0.
  • Bring back node 1.

Now the cluster can never elect a leader, even though it has a quorum, because in node 2's configuration, it only has itself and the old leader, node 0, so it will not accept/send vote requests from/to node 1, and node 0 is down.
I think this happens because only the leader can update the other followers' configuration, which will not happen if there's no leader.
To me it seems that this is an important bug, but I need someone to confirm that.

I caused this behaviour using the latest Consul Docker image. Here are the commands that should reproduce the issue:

docker run \
    -d \
    -p 8500:8500 \
    -p 8600:8600/udp \
    --name=node0 \
    consul agent -server -ui -node=server-0 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker run \
    -d \
    --name=node1 \
    consul agent -server -ui -node=server-1 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker run \
    -d \
    --name=node2 \
    consul agent -server -ui -node=server-2 -bootstrap-expect=3 -client=0.0.0.0 \
    -retry-join=172.17.0.2 -retry-join=172.17.0.3 -retry-join=172.17.0.4

docker stop node1 #given that node0 is the leader

docker stop node0

docker start node1

After running this you should see similar to the following messages on node 2:

failed to make requestVote RPC: target="{Voter 7505e313-c898-de46-944f-921948a36bb8 172.17.0.3:8300}" error="dial tcp <nil>->172.17.0.3:8300: connect: no route to host" term=17
rejecting vote request since node is not in configuration: from=172.17.0.2:8300

I also opened a bug in Consul as that's what I used to reproduce the problem: hashicorp/consul#15940

I re-ran these steps with Consul version 1.14.3 and was unable to reproduce the problem reliably. With the following changes to the steps, it should be reliably reproducible:

  • use docker kill insetad of docker stop when stopping the leader node.
  • after killing the first node, give enough time for the leader update the Raft configuration of the cluster before killing the leader. You will know when enough time has elapsed when you see this log on the leader:
    2023-02-06T11:30:42.806Z [INFO] agent.server.raft: updating configuration: command=RemoveServer server-id=25760f36-3d02-eda1-bccf-b7a05ee0d9c5 server-addr= servers="[{Suffrage:Voter ID:4ad7524b-7124-06cd-f40e-3d 980ec4ff30 Address:172.17.0.3:8300} {Suffrage:Voter ID:8ace7581-ccd7-4e2d-ee9f-8c8018775b4f Address:172.17.0.4:8300}]"
banks commented

Hi, I don't think this is a bug - the issue you described looses quorum which per Raft's design requires manual recovery.

The issue is complicated by autopilot features in some HashiCorp products but I commented on ways to control those in the Consul issue #15940.

Closing as I think this is working as expected but let us know if there is something we overlooked here!

Thanks for reporting!