hashicorp/raft

Raft re-adding peer gives Failed to AppendEntries error

sabaferoz1993 opened this issue · 2 comments

Hi I am using the library to implement a raft cluster using the following example

https://github.com/otoolep/hraftd

I am currently testing node failure and network partition test cases. However, I get the "failed to AppendEntries" issue when trying to reconnect a node.
Lets say I have a three node cluster C1,C2 and C3. Initially all the nodes work fine and C1 is elected as the leader. Also, when I try to shut down a single node (either leader or follower) and reconnect it, it connects successfully. However, when I try to turn down two follower nodes i.e. C2 and C3 and try to reconnect them then I start getting the AppendEntries error and I have to restart all the nodes and delete the storage to make it work.
Similarly, when I try to create a network partition using iptable rules and disconnect C2 from the other two nodes they seem to work fine. However, when C2 is reconnected I again get the same error.

I also tried to create a 5 node cluster with C1,C2,C3,C4 and C5 but I am facing the same issues on that as well.

I have already tried the solution mentioned at the following lin. However, it doesn't work for me.
#78

The exact error is pasted below:

[DEBUG] raft: Failed to contact cluster3 in 9.867687472s
2021-06-15T15:20:00.772+0300 [ERROR] raft: Failed to make RequestVote RPC to {Voter cluster3 127.0.1.1:31003}: read tcp 127.0.0.1:39376->127.0.1.1:31003: i/o timeout
2021-06-15T15:20:00.781+0300 [ERROR] raft: Failed to AppendEntries to {Voter cluster3 127.0.1.1:31003}: read tcp 127.0.0.1:39382->127.0.1.1:31003: i/o timeout
2021-06-15T15:20:00.899+0300 [ERROR] raft: Failed to heartbeat to 127.0.1.1:31003: read tcp 127.0.0.1:39388->127.0.1.1:31003: i/o timeout
2021-06-15T15:20:01.121+0300 [DEBUG] raft: Failed to contact cluster3 in 10.341185724s

Hi @sabaferoz1993 ,

Thanks for reporting this issue. I'm not entirely sure what went wrong, I'm hoping you can expand a little on your procedure.

Can you give the steps that you followed, maybe a script that can reproduce the problem?

The i/o timeout that you cite seems especially questionable - is it possible something went wrong in removing the iptable rule?

stale commented

Hey there, This issue has been automatically closed because there hasn't been any activity for a while. If you are still experiencing problems, or still have questions, feel free to open a new one :+1