bakwc/PySyncObj

broken leader re-election after killing most of cluster nodes

Closed this issue · 21 comments

I was killing some of a 3 node cluster randomly to verify an issue with re-electing leader and checking the status using

➜  wspace cat clusterstatus.sh 
syncobj_admin -conn 127.0.0.1:6000 -status
syncobj_admin -conn 127.0.0.1:6001 -status
syncobj_admin -conn 127.0.0.1:6002 -status

➜  wspace bash clusterstatus.sh | egrep 'leader:|self:'
leader: localhost:6000
self: localhost:6000
leader: localhost:6000
self: localhost:6001
leader: localhost:6000
self: localhost:6002
➜  wspace OA
zsh: command not found: OA
➜  wspace bash clusterstatus.sh | egrep 'leself:|leader:'  
leader: localhost:6000
self: localhost:6000
leader: localhost:6000
self: localhost:6001
leader: localhost:6000
self: localhost:6002
➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6000
self: localhost:6000
leader: localhost:6000
self: localhost:6001
leader: localhost:6000
self: localhost:6002
➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6001
self: localhost:6001
leader: localhost:6001
self: localhost:6002

Here the leader should've been set to 6001 but got None value instead

➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: None
self: localhost:6000
leader: localhost:6001
self: localhost:6001
leader: localhost:6001
self: localhost:6002
➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6001
self: localhost:6000
leader: localhost:6001
self: localhost:6001
leader: localhost:6001
self: localhost:6002
➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6002
self: localhost:6000
leader: localhost:6002
self: localhost:6001
leader: localhost:6002
self: localhost:6002
➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6001
self: localhost:6000
leader: localhost:6001
self: localhost:6001

And I reached this very interesting state where the node 6000 has a leader 6001 but that leader isn't even active?

➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6001
self: localhost:6000

it was fixed after launching 6002

➜  wspace bash clusterstatus.sh | egrep 'self:|leader:'
leader: localhost:6000
self: localhost:6000
leader: localhost:6000
self: localhost:6002

bakwc commented

Currently leader for node resets only when node receives some messages from other node. It's not a bug - raft requires more than half of the cluster alive to elect new leader, so in your scenario it's a normal behaviour.

@bakwc tested with 4 nodes and 3 of the nodes live and the re-election didn't happen

I saw this behaviour too, but in my case
I was testing with 4 nodes, 3 were active, I killed the leader.
There was no re-election.
I cannot reproduce this every time though.

kristofs-MacBook-Pro:recordchain kristofdespiegeleer$ syncobj_admin -conn 127.0.0.1:6001 -status -pass 1233
commit_idx: 1966
enabled_code_version: 0
last_applied: 1966
leader: localhost:6000
leader_commit_idx: 1966
log_len: 972
match_idx_count: 3
match_idx_server_localhost:6000: 0
match_idx_server_localhost:6002: 0
match_idx_server_localhost:6003: 0
next_node_idx_count: 3
next_node_idx_server_localhost:6000: 2
next_node_idx_server_localhost:6002: 2
next_node_idx_server_localhost:6003: 2
partner_node_status_server_localhost:6000: 2
partner_node_status_server_localhost:6002: 2
partner_node_status_server_localhost:6003: 2
partner_nodes_count: 3
raft_term: 2
readonly_nodes_count: 0
revision: 1899fe752bde334787dbfa54bb51bbd9fcf2826c
self: localhost:6001
self_code_version: 0
state: 0
unknown_connections_count: 1
uptime: 88
version: 0.3.3

now I kill the leader

kristofs-MacBook-Pro:recordchain kristofdespiegeleer$ syncobj_admin -conn 127.0.0.1:6001 -status -pass 1233
commit_idx: 3306
enabled_code_version: 0
last_applied: 3306
leader: None
leader_commit_idx: 3306
log_len: 316
match_idx_count: 3
match_idx_server_localhost:6000: 0
match_idx_server_localhost:6002: 0
match_idx_server_localhost:6003: 0
next_node_idx_count: 3
next_node_idx_server_localhost:6000: 2
next_node_idx_server_localhost:6002: 2
next_node_idx_server_localhost:6003: 2
partner_node_status_server_localhost:6000: 0
partner_node_status_server_localhost:6002: 0
partner_node_status_server_localhost:6003: 0
partner_nodes_count: 3
raft_term: 3
readonly_nodes_count: 0
revision: 1899fe752bde334787dbfa54bb51bbd9fcf2826c
self: localhost:6001
self_code_version: 0
state: 1
unknown_connections_count: 1
uptime: 163
version: 0.3.3

i am running the servers in tmux,
the 3 non leaders are working, but all have errors in setting data now

code where we test
https://github.com/rivine/recordchain/edit/master/JumpScale9RecordChain/servers/raft/README.md

restarting the leader leaves everything in limbo

bakwc commented

Thanks for report! I'll try to reproduce it. How long does it take to get this situation? Is it reproduces only when you have a password-protected cluster?

i'll try for non password protected, i'll do it now.

yes indeed, that seems to be the issue, without passwd I cannot reproduce .

bakwc commented

tested with 4 nodes and 3 of the nodes live and the re-election didn't happen

@xmonader, did you use the password? You created a 4-node cluster, killed only one node (leader) and new leader was not elected?

bakwc commented

Please try to increase following config options, set them to:

raftMinTimeout = 1.0
raftMaxTimeout = 3.0

sorry was away for this time, will try to do.

bakwc commented

Do you use python2 or python3?

Any update on this?
I have this issue too

bakwc commented

Could you please provide more details? What is your reproduce steps? What is the cluster size? How much nodes were alive?

My cluster size is 4
Frist I created a cluster and dynamically add 3 other node. I monitored syncObj._SyncObj__connectedNodes and saw 4 node and first node is leader.
when I kill other nodes cluster goes well. The killed node is in syncObj._SyncObj__otherNodes but it removed from syncObj._SyncObj__connectedNodes.
but for leader, If I kill the leader. different node show different nodes in syncObj._SyncObj__connectedNodes and none of them became leader
I just killed the leader.
I used these also : raftMinTimeout = 1.0 raftMaxTimeout = 3.0

Also when I removed a node from the cluster it remained in otherNodes in not leader nodes. It maybe also relate to this

bakwc commented

Thanks for report, I'll chek it. What script did you use for test purposes? Could you post it somewhere (pastebin.com)? Is it always reproduces or from time to time?
Your steps were:

  1. Start a single-node cluster
  2. Add 3 more nodes dynamically
  3. Kill the first node
  4. No new leader was elected
    Right?
bakwc commented

Checked multiple times, can't reproduce. Could you please provide detailed step-by-step instruction of your actions?

Acually I did what you explained.
I can share my code with you. I am writing a for dynamic cluster extension. So I have 10 ready to join nodes and for example target number of node to 4. The cluster should expand itself to reach 4. after that I will change the target to 8 after reduce to 3. So I want can add and remove nodes.
To do this the cluster should remain available with different failure. https://pastebin.com/svvG3eHK it is very dirty test code. I have port scanner to find other nodes. I have some problem to remove and add nodes too.

Also if you want I can show my screen in any call

bakwc commented

When adding nodes you need to specify all current cluster nodes manually. Added #112 to make auto-discovery.

Thank you,
It works for me