Peer Pool crashes, dies but does not crash the whole node
cburgdorf opened this issue · 4 comments
What is wrong?
It appears that my peer pool crashed around 4:19 and from there did not recover again (no further logs from ETHPeerPool
after the crash).
�[1m�[33m WARNING 2020-09-04 04:19:18,179 Sync / PeerPool Event loop blocked or overloaded: delay=2.005s, tasks=74�[0m
�[1m�[31m ERROR 2020-09-04 04:19:18,181 asyncio Exception in callback <TaskWakeupMethWrapper object at 0x7fa47b456e90>
handle: <Handle <TaskWakeupMethWrapper object at 0x7fa47b456e90>>
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 70, in uvloop.loop.Handle._run
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
await task.run()
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 165, in run
await self.child_manager.run()
File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 246, in run
for _, exc_value, exc_tb in self._errors
trio.MultiError: TimeoutError('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>}), TimeoutError('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>})
Details of embedded exception 1:
Traceback (most recent call last):
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 82, in cancel_pending_tasks
yield
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 54, in wait_first
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
File "/usr/local/lib/python3.7/asyncio/tasks.py", line 389, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/local/lib/python3.7/asyncio/tasks.py", line 482, in _wait
await waiter
concurrent.futures._base.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
await task.run()
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 165, in run
await self.child_manager.run()
File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 246, in run
for _, exc_value, exc_tb in self._errors
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
await task.run()
File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 35, in run
await self._async_fn(*self._async_fn_args)
File "/usr/src/app/trinity/p2p/connection.py", line 133, in run_behaviors
await wait_first(futures, max_wait_after_cancellation=2)
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 69, in wait_first
raise done_task.exception()
File "/usr/local/lib/python3.7/contextlib.py", line 188, in __aexit__
await self.gen.athrow(typ, value, traceback)
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 113, in cancel_pending_tasks
raise MultiError(errors)
concurrent.futures._base.TimeoutError: ('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>})
Details of embedded exception 2:
Traceback (most recent call last):
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 82, in cancel_pending_tasks
yield
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 54, in wait_first
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
File "/usr/local/lib/python3.7/asyncio/tasks.py", line 389, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/local/lib/python3.7/asyncio/tasks.py", line 482, in _wait
await waiter
concurrent.futures._base.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
await task.run()
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 165, in run
await self.child_manager.run()
File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 246, in run
for _, exc_value, exc_tb in self._errors
File "/usr/local/lib/python3.7/site-packages/async_service/base.py", line 300, in _run_and_manage_task
await task.run()
File "/usr/local/lib/python3.7/site-packages/async_service/asyncio.py", line 35, in run
await self._async_fn(*self._async_fn_args)
File "/usr/src/app/trinity/p2p/connection.py", line 133, in run_behaviors
await wait_first(futures, max_wait_after_cancellation=2)
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 69, in wait_first
raise done_task.exception()
File "/usr/local/lib/python3.7/contextlib.py", line 188, in __aexit__
await self.gen.athrow(typ, value, traceback)
File "/usr/src/app/trinity/p2p/asyncio_utils.py", line 113, in cancel_pending_tasks
raise MultiError(errors)
concurrent.futures._base.TimeoutError: ('Tasks never returned after being cancelled: %s', {<Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled coro=<wait_first() done, defined at /usr/src/app/trinity/p2p/asyncio_utils.py:31>>})�[0m
Interestingly the node kept running which is a bug in itself. It should crash the whole node if that happens.
Not sure if it's related but the metrics show that 6 hours before the disk activity falls off a cliff.
(Be aware that the metrics report UTC+2 whereas the logs report UTC (#2017))
How can it be fixed
Needs further investigation
@gsalgado I haven't investigated this but I thought I draw your awareness since you've recently worked on this. The logs (attached in the issue) clearly indicate that the peer pool crashed but the node kept running.
One thing that looks suspicious to me is that we do not seem to catch any timeout exception that may come out of this call (based on the max_wait_after_cancellation
).
Lines 133 to 138 in 7c7af0e
We should probably catch TimeoutError
and log it as a warning/error wherever we call wait_first()
. These could be caused by a bug when terminating any long-running-task/service but may also happen when the event loop is overloaded, so we probably don't want to let them crash a component.
Ok, so in this case trinity continued running because the component (Sync / PeerPool
) itself didn't terminate. And the component didn't terminate because the PeerPool
crash didn't cause the FullServer
to terminate as it should (since the pool runs as a daemon service of the server):
Line 121 in 7c7af0e
So, I'll change this issue to be about fixing all wait_first()
callsites to catch TimeoutError
, as those can cause trinity to crash, and thus should be fixed ASAP. And I'll open a separate issue to investigate why FullServer
didn't terminate when the PeerPool
crashed