Stress test the network reliability
Opened this issue · 0 comments
Why
We have experienced many situations in which the head can not progress. These problems are hard to reproduce and we have spent a lot of time in coordination attempting to resolve the problem in each case.
Records of these issues are here:
One possible solution was a manual snapshot recovery as outlined here:
This is unsatisfying as we would prefer to make the nodes self-healing and not require manual intervention.
What
-
We should challenge the assumptions of the reliability layer. This is currently a combination of ouroboros-network and an implementation of Logged Uniform Reliable Broadcast found in https://fileadmin.cs.lth.se/cs/Personal/Amr_Ergawy/dist-algos-slides/fourth-presentation.pdf
-
We also want to challenge the assumption that the on-disk persistence of the vector clock and outbound messages is actually needed. #1417
-
We also want to provide a way for users with stuck heads to collect diagnostic information and submit it to the team for analysis.
How
- Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is a peer that fails to send, receive or persist network messages.
- (Optional) Extract the network layer into its own package to remove coupling.
- Create an issue template for submitting stuck head problems, and optionally a mechanism to allow users to provide data there.