hashicorp/raft

ErrLeadershipLost may cause the system unable to work

Closed this issue · 3 comments

I found that a leader may apply a log failed because leadership lost while committing log.
However, the raft.Leader() still return the origin leader.
Moreover, there will be no election because all nodes think that the old leader are still working, which means there exists no really leader so the system can not working unless I stop the old leader manually.
Looking forward to your reply.

I found this problem while I set snapshot interval and threshold very little, like 100ms and 1. I don't know if the bug may caused by some problems together, but I do notice if the leader occurs ErrLeadershipLost and it will never take snapshot. But there are still some entries applied before and because threshold is 1 so it should snapshot if it works normally.
Moreover, I notice that the raft.Shutdown().Error() may never return, I guess there may exist dead lock or something else.
I hope these problems can be fixed soon because I really worry that restart the node to avoid these problems is not a good way and may cause another problems.
Thanks!

Hello, and thank you so much for your question!

To ensure I'm understanding properly, the snapshot interval is set to 1, which means a snapshot is occurring every time there is a new log entry. When a snapshot is set to take place, ErrLeadershipLost occurs and the snapshots hang, is that right? Is there any spike in resource usage (disk IO, etc) at this time?

Would you mind showing us any logs at the time this occurs? Also if you have a repository or any type of replication steps you could provide I'd be more than happy to check this out! :)

Given we haven't heard anything based on our suggestions/questions above I'm going to close this issue, but I encourage you to comment and we can re-open it if you want to pick this up again.

Alternatively, if things have changed dramatically, feel free to create a new issue or PR.