ethereum/trinity

Peer Pool crashes, dies but does not crash the whole node

cburgdorf opened this issue · 4 comments

What is wrong?

It appears that my peer pool crashed around 4:19 and from there did not recover again (no further logs from ETHPeerPool after the crash).

�[1m�[33m WARNING  2020-09-04 04:19:18,179       Sync / PeerPool  Event loop blocked or overloaded: delay=2.005s, tasks=74�[0m
�[1m�[31m   ERROR  2020-09-04 04:19:18,181               asyncio  Exception in callback <TaskWakeupMethWrapper object at 0x7fa47b456e90>
handle: <Handle <TaskWakeupMethWrapper object at 0x7fa47b456e90>>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 70, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
    await task.run()
  File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 165, in run
    await self.child_manager.run()
  File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 246, in run
    for _, exc_value, exc_tb in self._errors
trio.MultiError: TimeoutError('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>}), TimeoutError('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>})

Details of embedded exception 1:

  Traceback (most recent call last):
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 82, in cancel_pending_tasks
      yield
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 54, in wait_first
      done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
    File "/usr/local/lib/python3.7/asyncio/tasks.py", line 389, in wait
      return await _wait(fs, timeout, return_when, loop)
    File "/usr/local/lib/python3.7/asyncio/tasks.py", line 482, in _wait
      await waiter
  concurrent.futures._base.CancelledError

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
      await task.run()
    File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 165, in run
      await self.child_manager.run()
    File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 246, in run
      for _, exc_value, exc_tb in self._errors
    File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
      await task.run()
    File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 35, in run
      await self._async_fn(*self._async_fn_args)
    File "/usr/src/app/trinity/p2p/connection.py", line 133, in run_behaviors
      await wait_first(futures, max_wait_after_cancellation=2)
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 69, in wait_first
      raise done_task.exception()
    File "/usr/local/lib/python3.7/contextlib.py", line 188, in __aexit__
      await self.gen.athrow(typ, value, traceback)
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 113, in cancel_pending_tasks
      raise MultiError(errors)
  concurrent.futures._base.TimeoutError: ('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>})

Details of embedded exception 2:

  Traceback (most recent call last):
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 82, in cancel_pending_tasks
      yield
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 54, in wait_first
      done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
    File "/usr/local/lib/python3.7/asyncio/tasks.py", line 389, in wait
      return await _wait(fs, timeout, return_when, loop)
    File "/usr/local/lib/python3.7/asyncio/tasks.py", line 482, in _wait
      await waiter
  concurrent.futures._base.CancelledError

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
      await task.run()
    File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 165, in run
      await self.child_manager.run()
    File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 246, in run
      for _, exc_value, exc_tb in self._errors
    File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
      await task.run()
    File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 35, in run
      await self._async_fn(*self._async_fn_args)
    File "/usr/src/app/trinity/p2p/connection.py", line 133, in run_behaviors
      await wait_first(futures, max_wait_after_cancellation=2)
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 69, in wait_first
      raise done_task.exception()
    File "/usr/local/lib/python3.7/contextlib.py", line 188, in __aexit__
      await self.gen.athrow(typ, value, traceback)
    File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 113, in cancel_pending_tasks
      raise MultiError(errors)
  concurrent.futures._base.TimeoutError: ('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>})�[0m
   

Interestingly the node kept running which is a bug in itself. It should crash the whole node if that happens.

Not sure if it's related but the metrics show that 6 hours before the disk activity falls off a cliff.

(Be aware that the metrics report UTC+2 whereas the logs report UTC (#2017))

image

trinity_logs.zip

How can it be fixed

Needs further investigation

@gsalgado I haven't investigated this but I thought I draw your awareness since you've recently worked on this. The logs (attached in the issue) clearly indicate that the peer pool crashed but the node kept running.

One thing that looks suspicious to me is that we do not seem to catch any timeout exception that may come out of this call (based on the max_wait_after_cancellation).

trinity/p2p/connection.py

Lines 133 to 138 in 7c7af0e

await wait_first(futures, max_wait_after_cancellation=2)
except PeerConnectionLost:
# Any of our behaviors may propagate a PeerConnectionLost, which is to be
# expected as many Connection APIs used by them can raise that. To avoid a
# DaemonTaskExit since we're returning silently, ensure we're cancelled.
self.manager.cancel()

We should probably catch TimeoutError and log it as a warning/error wherever we call wait_first(). These could be caused by a bug when terminating any long-running-task/service but may also happen when the event loop is overloaded, so we probably don't want to let them crash a component.

Ok, so in this case trinity continued running because the component (Sync / PeerPool) itself didn't terminate. And the component didn't terminate because the PeerPool crash didn't cause the FullServer to terminate as it should (since the pool runs as a daemon service of the server):

self.manager.run_daemon_child_service(self.peer_pool)

So, I'll change this issue to be about fixing all wait_first() callsites to catch TimeoutError, as those can cause trinity to crash, and thus should be fixed ASAP. And I'll open a separate issue to investigate why FullServer didn't terminate when the PeerPool crashed