Improve handling of multiple connections per peer

Question

Improve handling of multiple connections per peer

Opened this issue a year ago · 0 comments

We can have multiple connection for the same peer. The way this is handle currently is that we multiplex those connections onto one "virtual connection" which we then treat as if it were a single connection. In details, incoming messages are multiplexed onto the virtual connection (that is, all messages from all connections are received) but outgoing messages are sent only on one connection - the first one in the list of connections. If the sending fails, that connection is removed from the list and the send is repeated on the next connection. This means that in practice only one connection per peer is actually being used at any given time.

This design is not ideal for several reasons:

Even though only one connection is used, the remaining connections are still kept open using keep-alive packets which can contribute to battery drain (although the keep-alives are sent once a minute which might not be too bad in practice).
The active connection is picked arbitrarily regardless of how "good" that connection is. For example, consider a peer we have two connections to - one local and one global. The current system might blindly use the global one even though the local one has likely better bandwidth.
The implementation is too complex (for example, it requires complicated components like MultiStream, MultiSink or Barrier to work) and hard to reason about and debug.

Proposed improvements:

Instead of keeping multiple connections per peer, we keep only one, but have a mechanism to replace it with another connection if it is "better". In more details:

When a new connection is established, we perform the handshake. If it succeeds, we obtain the peer's runtime id and also it proves the connection is good. We then check whether we already have a connection to that peer. If not, we set it as the active connection for that peer. Otherwise we compare the existing and the new connections to determine which is "better" (more on that later). If the existing one is better, we close the new one but wait until the existing one is closed and then try to re-establish it. If the new one is better, we close the old one and replace it with the new one.

In order for this to work consistently, only one of the peers will perform the connection selection. We need some way to pick this peer which both peers would agree on. One simple way which might be sufficient is to pick the one with the higher runtime id.

Determining which connection is better can be done in multitude of ways. One very simple one is to consider only the connection protocol (TCP, QUIC, ...) and IP address:

Prefer local over global
Prefer QUIC over TCP
If both local (or global) and both have the same protocol, prefer the existing one

A more robust way would be to measure the round-trip time of the connections and pick the lower one. For QUIC, the quinn crate exposes an API for that. For TCP this info exists (linux, windows) as well, but there is currently no API in rust (tokio or std) to expose it. It might be possible to expose it ourselves but it would require writing a low-level and very platform dependent code. Alternatively, we can measure RTT ourselves at the application level, or use the RTT test only for QUIC connections.

That said, the simple protocol + ip based algorithm would likely already be an improvement over what we have currently.