Stress test the network reliability

Question

Stress test the network reliability

Opened this issue a month ago · 0 comments

Why

We have experienced many situations in which the head can not progress. These problems are hard to reproduce and we have spent a lot of time in coordination attempting to resolve the problem in each case.

Records of these issues are here:

#1374
#1415

One possible solution was a manual snapshot recovery as outlined here:

#1416

This is unsatisfying as we would prefer to make the nodes self-healing and not require manual intervention.

What

We should challenge the assumptions of the reliability layer. This is currently a combination of ouroboros-network and an implementation of Logged Uniform Reliable Broadcast found in https://fileadmin.cs.lth.se/cs/Personal/Amr_Ergawy/dist-algos-slides/fourth-presentation.pdf
We also want to challenge the assumption that the on-disk persistence of the vector clock and outbound messages is actually needed. #1417
We also want to provide a way for users with stuck heads to collect diagnostic information and submit it to the team for analysis.

How

Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is a peer that fails to send, receive or persist network messages.
(Optional) Extract the network layer into its own package to remove coupling.
Create an issue template for submitting stuck head problems, and optionally a mechanism to allow users to provide data there.