stellar/stellar-core

overlay: consider removing connection "drop modes"

marta-lokhova opened this issue · 0 comments

Currently, we implement two drop modes in our connection strategy. Whenever a connection is dropped, there are ways to deal with its existing outbound queue of messages:

  • "ignore queue": In this mode, we presumably stop sending anything outbound and immediately begin shutting down the socket.
  • "flush queue": Flush whatever we currently have in the queue, but don't add any new work to it. Once the queue is empty, begin shutting down the socket.

While this makes sense conceptually, there are a few issues with the current implementation:

  • The current usage of "ignore queue" mode does not work in certain drop scenarios as expected. For example, a common way to drop a peer is to first send an error followed by a socket shutdown (this way, peers on the other end know why they are being dropped). This is done in the sendErrorAndDrop function, which is used with the "ignore queue" mode throughout the codebase. The problem is that the "ignore queue" mode doesn't differentiate or prioritize error messages, so they end up getting dropped anyway.
  • The modes were originally added in #1558, when the shutdown was triggered immediately after a drop. Later, #2501 introduced a 5-second delay to socket shutdown, which essentially made drop modes redundant since, with the new delay, we are extremely likely to send the outstanding queue anyway.

To simplify the code and its maintenance, we could probably remove the modes completely in favor of the socket shutdown delay, as it seems that the current delay provides us with enough of a buffer to flush the required messages (we only really use the "flush queue" mode during authentication, where we exchange a very small number of messages).