Peer join race condition
lbradstreet opened this issue · 2 comments
lbradstreet commented
Found by Jepsen.
What I know about this issue:
- See the attached log. I have truncated it after 5000 lines. It all looks the same for the rest of the 250K lines :D
log_prepare_abort_join_5000.txt - At line 1203, the several peers perpetually prepare/abort. I believe this is due to this case, where there is no diff in the replica https://github.com/onyx-platform/onyx/blob/0.8.x/src/onyx/log/commands/prepare_join_cluster.clj#L45. Alas, the peer aborts and then tries to prepare again over and over again.
I initially thought this was due to the failure monitor, but no leave-clusters are sent out after a certain point. Since peers that see leave-cluster just suicide and rejoin, I no longer think it's related. That said, I did see a couple of exceptions in the failure monitor, and maybe peers are still getting deadlocked as a result of these exceptions.
MichaelDrogalis commented
Tracking pending patch in #454.
lbradstreet commented
I believe this has been fixed by #484, but I would not be surprised to see it pop up again. I will continue the Jepsening, and close for now.