ethereum/trinity

ConnectionBusy exception thrown from BeamDownloader

cburgdorf opened this issue · 7 comments

What is wrong?

I've noticed the following exception.

f2b7-40b4-a77a-6ba6d13da819>: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xd4062d@91.13.194.238)> 1c231c61-f2b7-40b4-a77a-6ba6d13da819>�[0m
   DEBUG  2020-09-02 12:59:58,686        BeamDownloader  Problem downloading nodes from peer, dropping...
Traceback (most recent call last):
  File "/usr/src/app/trinity/p2p/exchange/candidate_stream.py", line 81, in payload_candidates
    timeout=total_timeout * NUM_QUEUED_REQUESTS,
  File "/usr/local/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/trinity/trinity/sync/beam/state.py", line 681, in _request_nodes
    node_hashes, timeout=self._reply_timeout)
  File "/usr/src/app/trinity/trinity/protocol/eth/exchanges.py", line 110, in __call__
    timeout,
  File "/usr/src/app/trinity/p2p/exchange/exchange.py", line 73, in get_result
    timeout,
  File "/usr/src/app/trinity/p2p/exchange/manager.py", line 59, in get_result
    async for payload in stream.payload_candidates(request, tracker, timeout=timeout):
  File "/usr/src/app/trinity/p2p/exchange/candidate_stream.py", line 85, in payload_candidates
    f"Timed out waiting for {self.response_cmd_name} request lock "
p2p.exceptions.ConnectionBusy: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xd4062d@91.13.194.238)> 1c231c61-f2b7-40b4-a77a-6ba6d13da819>

It seems that after that error the client was left in a broken state and the logs where dominated by this warning.

WARNING  2020-09-02 13:09:50,331        BeamDownloader  Unexpected p2p err while downloading nodes from ETHPeer (eth, 65) <Session <Node(0xd4062d@91.13.194.238)> 1c231c61-f2b7-40b4-a77a-6ba6d13da819>: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xd4062d@91.13.194.238)> 1c231c61-f2b7-40b4-a77a-6ba6d13da819>

Full debug log:

trinity_debug_log.zip

How can it be fixed

The error suggests that we have multiple code paths making simultaneous NodeDataV65 requests to the same peer. @carver I guess this is happening because state backfill and regular state fetching are going against the same peer. We had a similar discussion when we hit the same error for GetBlockHeader when we introduced the header backfill. We ended up bumping up NUM_QUEUED_REQUESTS to allow more concurrent requests but I'm not sure if that is the solution here 🤔

Yeah, there shouldn't be this many simultaneous requests, it's a sign of some other bug. So 👎 on increasing NUM_QUEUED_REQUESTS. No guesses yet on why you're seeing it, though. In order to trigger it, you usually need to have a peer that's good for a while, and then have it disappear and start timing out to all requests. So any test to try to recreate the situation would probably have to follow that script.

There's a good chance that this error doesn't exist anymore. When I filed the issue I assumed I was testing latest master but due to a bug in my DappNode deployment workflow I ended up on old code. Now my node runs on the latest code and the bug did not happen within the past 18 hours. Closing for now.

Ah no, this is in fact still happening with the latest source.


�[1m�[33m WARNING  2020-09-10 01:38:37,849        BeamDownloader  Unexpected p2p err while downloading nodes from ETHPeer (eth, 65) <Session <Node(0xcbceb7@172.33.0.2)> bd53557e-3322-4fe3-a664-d458de052586>: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xcbceb7@172.33.0.2)> bd53557e-3322-4fe3-a664-d458de052586>�[0m
   DEBUG  2020-09-10 01:38:37,849        BeamDownloader  Problem downloading nodes from peer, dropping...
Traceback (most recent call last):
  File "/usr/src/app/trinity/p2p/exchange/candidate_stream.py", line 81, in payload_candidates
    timeout=total_timeout * NUM_QUEUED_REQUESTS,
  File "/usr/local/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/trinity/trinity/sync/beam/state.py", line 780, in _request_nodes
    node_hashes, timeout=self._reply_timeout)
  File "/usr/src/app/trinity/trinity/protocol/eth/exchanges.py", line 110, in __call__
    timeout,
  File "/usr/src/app/trinity/p2p/exchange/exchange.py", line 73, in get_result
    timeout,
  File "/usr/src/app/trinity/p2p/exchange/manager.py", line 59, in get_result
    async for payload in stream.payload_candidates(request, tracker, timeout=timeout):
  File "/usr/src/app/trinity/p2p/exchange/candidate_stream.py", line 85, in payload_candidates
    f"Timed out waiting for {self.response_cmd_name} request lock "
p2p.exceptions.ConnectionBusy: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xcbceb7@172.33.0.2)> bd53557e-3322-4fe3-a664-d458de052586>
   DEBUG  2020-09-10 01:38:37,877         QueeningQueue  Penalizing ETHPeer (eth, 65) <Session <Node(0xcbceb7@172.33.0.2)> bd53557e-3322-4fe3-a664-d458de052586> for 2.00s, for minor infraction
�[1m�[33m WARNING  2020-09-10 01:38:37,904        BeamDownloader  Unexpected p2p err while downloading nodes from ETHPeer (eth, 63) <Session <Node(0xd05d87@3.214.216.237)> 4a651c05-7993-445d-a9dc-0ae741ec7b5b>: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xd05d87@3.214.216.237)> 4a651c05-7993-445d-a9dc-0ae741ec7b5b>�[0m
   DEBUG  2020-09-10 01:38:37,905        BeamDownloader  Problem downloading nodes from peer, dropping...
Traceback (most recent call last):
  File "/usr/src/app/trinity/p2p/exchange/candidate_stream.py", line 81, in payload_candidates
    timeout=total_timeout * NUM_QUEUED_REQUESTS,
  File "/usr/local/lib/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/trinity/trinity/sync/beam/state.py", line 780, in _request_nodes
    node_hashes, timeout=self._reply_timeout)
  File "/usr/src/app/trinity/trinity/protocol/eth/exchanges.py", line 110, in __call__
    timeout,
  File "/usr/src/app/trinity/p2p/exchange/exchange.py", line 73, in get_result
    timeout,
  File "/usr/src/app/trinity/p2p/exchange/manager.py", line 59, in get_result
    async for payload in stream.payload_candidates(request, tracker, timeout=timeout):
  File "/usr/src/app/trinity/p2p/exchange/candidate_stream.py", line 85, in payload_candidates
    f"Timed out waiting for {self.response_cmd_name} request lock "
p2p.exceptions.ConnectionBusy: Timed out waiting for NodeDataV65 request lock or connection: Connection-<Session <Node(0xd05d87@3.214.216.237)> 4a651c05-7993-445d-a9dc-0ae741ec7b5b>

Hm, any chance that 0xd05d87 disconnected and reconnected once or a couple times before this happened? I'm trying to think about how we could end up with so many live queries for GetNodeData.

Since you seem to be getting this more often than me, can you try out #2039 ?

Edit: well maybe not yet. I was a bit over-confident that it would work flawlessly on the first shot 😆
Edit 2: okay, looks resolved now...

Another thing that might be worth considering is just throwing a p2p timeout error instead. I actually think the current exception is showing us important information, but if we can't tackle it in the next week, then I think it's okay to stop emitting it. I think a timeout error is fine in all cases where a ConnectionBusy would be raised.

Seems to have been resolved according to #2039 (review)

I can confirm that this fixes the ConnectionBusy issue 👍