ethereum/trinity

Hard crash: Tasks never returned after being cancelled

carver opened this issue · 3 comments

  • Trinity version: ad80ae5 (master)
  • OS: linux
  • Python version (output of python --version): 3.8.5

What is wrong?

<Manager[Connection-<Session <Node(0xe948ba@34.77.191.47)> baadf3ca-aad7-4ef3-9fa6-1dfc905061b5>] flags=SRCfe>: task run_behaviors[daemon=True] exited with error: ('Tasks never returned after being cancelled: %s', {<Task cancelled name='A
pplication/p2p/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth1-chain-info/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wai
t_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth1-head-info/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyn
cio_utils.py:31>>, <Task cancelled name='Application/eth/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>})
Traceback (most recent call last):
  File "/home/jcarver/code/trinity/p2p/asyncio_utils.py", line 82, in cancel_pending_tasks
    yield
  File "/home/jcarver/code/trinity/p2p/asyncio_utils.py", line 54, in wait_first
    done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
  File "/home/jcarver/.pyenv/versions/3.8.5/lib/python3.8/asyncio/tasks.py", line 426, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/home/jcarver/.pyenv/versions/3.8.5/lib/python3.8/asyncio/tasks.py", line 523, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcarver/code/async-service/async_service/base.py", line 300, in _run_and_manage_task
    await task.run()
  File "/home/jcarver/code/async-service/async_service/asyncio.py", line 36, in run
    await asyncio.create_task(
  File "/home/jcarver/code/trinity/p2p/connection.py", line 134, in run_behaviors
    await wait_first(futures, max_wait_after_cancellation=2)
  File "/home/jcarver/code/trinity/p2p/asyncio_utils.py", line 69, in wait_first
    raise done_task.exception()
  File "/home/jcarver/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 189, in __aexit__
    await self.gen.athrow(typ, value, traceback)
  File "/home/jcarver/code/trinity/p2p/asyncio_utils.py", line 113, in cancel_pending_tasks
    raise MultiError(errors)
asyncio.exceptions.TimeoutError: ('Tasks never returned after being cancelled: %s', {<Task cancelled name='Application/p2p/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth1-chain-info/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth1-head-info/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>})

How can it be fixed

🤷‍♂️ any ideas @gsalgado ? It doesn't make sense to me that all the listed coros are "done" or "cancelled", but they were in the pending set returned by asyncio.wait.

I think this workaround will at least reduce the frequency of the issue:

diff --git a/p2p/connection.py b/p2p/connection.py
index 738213e45..346072186 100644
--- a/p2p/connection.py
+++ b/p2p/connection.py
@@ -131,7 +131,7 @@ class Connection(ConnectionAPI, Service):
                 try:
                     for behavior in behaviors:
                         behavior.post_apply()
-                    await wait_first(futures, max_wait_after_cancellation=2)
+                    await wait_first(futures, max_wait_after_cancellation=10)
                 except PeerConnectionLost:
                     # Any of our behaviors may propagate a PeerConnectionLost, which is to be
                     # expected as many Connection APIs used by them can raise that. To avoid a

It doesn't make sense to me that all the listed coros are "done" or "cancelled", but they were in the pending set returned by asyncio.wait

The only explanation I can think of is that the asyncio.wait() task yielded because some tasks were still pending and then the next time it executed the timeout had already passed and it returned immediately, even though all tasks were actually done. Maybe this could happen if the event loop is overloaded. Can you check your logs to see if that could have been the case?

As for the timeout, 2s may be too low, and although 10s is probably too much I guess it won't hurt us in any way

Yeah, the delay isn't horrendous, but there are a lot of tasks pending:

    INFO  2020-08-26 23:00:15,734       Sync / PeerPool  Asyncio event loop blocked or overloaded: delay=0.797s, pending=8484, done=0, common_tasks=DISABLED
   DEBUG  2020-08-26 23:00:16,979            wait_first  Task {<Task finished name='AsyncioIsolatedComponent/Sync / PeerPool/do_run' coro=<SyncerComponent.do_run() done, defined at /home/jcarver/code/trinity/trinity/components/builtin/syncer/component.py:356> exception=TimeoutError('Tasks never returned after being cancelled: %s', {<Task cancelled name='Application/p2p/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth1-chain-info/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth1-head-info/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>, <Task cancelled name='Application/eth/apply/<Node(0xe948ba@34.77.191.47)>' coro=<wait_first() done, defined at /home/jcarver/code/trinity/p2p/asyncio_utils.py:31>>})>} finished, cancelling pending ones: {<Task pending name='AsyncioIsolatedComponent/Sync / PeerPool/loop_monitoring_task' coro=<AsyncioIsolatedComponent._loop_monitoring_task() running at /home/jcarver/code/trinity/trinity/extensibility/asyncio.py:27> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f970d4e8f70>()]>>, <Task pending name='AsyncioIsolatedComponent/Sync / PeerPool/eventbus/wait_finished' coro=<AsyncioManager.wait_finished() running at /home/jcarver/code/async-service/async_service/asyncio.py:286> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f971a9260d0>()]>>}

There seems to be some kind of bug where the Beam Sync was caught up, but a bunch of (old) "preview" blocks were hanging, waiting for the data to show up. I believe that's why the task count was so high. I'm working on that separately.