celestiaorg/celestia-core

Sending messages slows to a crawl when connecting more peers.

Opened this issue · 2 comments

In the network tests where we run only two nodes, proposers are able to send block parts and message at a very high rate (100MB/s).

However when we increase the number of nodes, this stops being the case. Proposers are only able to send block parts at a fraction of the allocated bandwidth. The next steps are to recreate the issue locally by using a simple reactor and multiple local tcp connection with latency.

Before we can move forward with closing this issue, we must be able to at least prove that one aspect of the stack is causing the delay in sending messages. Once we know this, we can reevaluate the next path forward.

The thinking behind this issue stems from tracing the entire lifecycle of a block part. Below we can see a block part start its journey at the bottom with the block proposer. From there, it is gossiped to the proposer's peers, and then to their peers and so on. Blue indicates that its the first time a peer received a block part.

bp_trace

here's a zoomed in image of the above

bp_trace_zoom_1

If we measure the transit time (time from calling Send is the reactor of the sender to Receive in the reactor of the receiver) for each node in the path to deliver block parts, we can see parts taking a long time to get from the proposer to the other nodes. Each horizontal bar on the y axis is the time (x axis) it takes at each hop for the path of the same block part to reach each node. For smaller block sizes, we something like the plot below.

transit time trace

We can see the first two hops, the first in blue from the proposer to its peers, then green from those peers to the next take a while.

When we look at larger blocks, we almost universally see very very long times for the proposer to send each part to the first peer, with significantly shorter times after that. Notice the enormous blue Segment 1 (the first hop from the proposer to its peers) then the smaller other colors.

Screenshot from 2024-05-27 09-24-27

When we look at traces of when we are sending and receiving the block parts overlayed with bandwidth, we see that most of the wait in on the send side. As soon as the nodes receive the block part, they are processing it. We can also see that we aren't using even close to all of our allocated bandwdith per peer

sending and reacting quickly, distributing slowly

this differs from when there are only two nodes in the network, where we see we can utilize as much bandwidth as we have allocated (up to 100MB/s)

network trace 2 validators

We can replicate the bug here:

  • very much WIP branch, but we can replicate the issue using this test after adding latency locally using tc
  • Added more traces to figure out where the latency is coming from

I'm in the middle of digging deeper into the above to figure out exactly where the latency is coming from. Its possible that we have a bit of a "buffer hernia", where we are adding to the buffers at the configured send and receive rate, but those buffers are not actually being emptied by tcp. If this is occuring, and we don't actually have a mechanism to stop adding to the buffer when the tcp buffer is full, then this could explain why we see very late block part messages and huge delays from block parts when proposing new blocks.

after debugging more today, it looks like we are unable to clear the buffer on the receiver side, which is forcing the send side to send data slowly. It could just be the buffer being too small on the receiver side.

Will run more experiments tmrw!