ethereum/trinity

_preview_address_load() never completing

carver opened this issue · 3 comments

What is wrong?

There's a leak in the number of active coroutines. After applying this patch to async-service: ethereum/async-service#77 it raised:

  File "/home/jcarver/code/trinity/trinity/sync/full/chain.py", line 1109, in _preview_ready_blocks
    await self._block_importer.preview_transactions(
  File "/home/jcarver/code/trinity/trinity/sync/beam/chain.py", line 802, in preview_transactions
    self.manager.run_task(self._preview_address_load, header, parent_state_root, transactions)
  File "/home/jcarver/code/async-service/async_service/asyncio.py", line 337, in run_task
    self._common_run_task(task)
  File "/home/jcarver/code/async-service/async_service/base.py", line 289, in _common_run_task
    self._add_child_task(task.parent, task)
  File "/home/jcarver/code/async-service/async_service/base.py", line 304, in _add_child_task
    raise TooManyChildrenException(
async_service.exceptions.TooManyChildrenException: Tried to add more than 1000 child tasks. Most common tasks: [('_preview_address_load[daemon=False]', 1000), ('run[daemon=False]', 1)]

How can it be fixed

Some options:

  • Patch the problem, by putting a timeout on _preview_address_load()
  • Figure out why it's not finishing. Maybe... Wrong state root? Predictive hang that is never woken up by urgent data? Something else?

Hm, well something seems to be causing a hang of some kind. Basically, _preview_address_load is never completing until the block that it is previewing is under active review. I added a log at the end of every preview, and the logs look like this:

    INFO  2020-09-17 16:19:01,009   BaseBodyChainSyncer  Imported block 10882660 (112 txs) in 12.22 seconds, lagging 6 blocks | 1m22s
   DEBUG  2020-09-17 16:19:03,356     BeamBlockImporter  Preview accounts end for <BlockHeader #10882661 8d93389a>, collected 73 nodes in 65.8s
    INFO  2020-09-17 16:19:12,006   BaseBodyChainSyncer  Imported block 10882661 (167 txs) in 11.00 seconds, lagging 5 blocks | 1m24s
   DEBUG  2020-09-17 16:19:13,670     BeamBlockImporter  Preview accounts end for <BlockHeader #10882662 a2177b82>, collected 1 nodes in 65.5s
    INFO  2020-09-17 16:19:24,694   BaseBodyChainSyncer  Imported block 10882662 (148 txs) in 12.69 seconds, lagging 4 blocks | 1m28s
   DEBUG  2020-09-17 16:19:25,955     BeamBlockImporter  Preview accounts end for <BlockHeader #10882663 0d05c77b>, collected 3 nodes in 73.0s
    INFO  2020-09-17 16:19:35,561   BaseBodyChainSyncer  Imported block 10882663 (180 txs) in 10.87 seconds, lagging 3 blocks | 1m28s
   DEBUG  2020-09-17 16:19:37,522     BeamBlockImporter  Preview accounts end for <BlockHeader #10882664 707dc755>, collected 75 nodes in 69.0s
    INFO  2020-09-17 16:19:48,685   BaseBodyChainSyncer  Imported block 10882664 (178 txs) in 13.12 seconds, lagging 8 blocks | 1m37s
   DEBUG  2020-09-17 16:19:50,441     BeamBlockImporter  Preview accounts end for <BlockHeader #10882665 9330f3aa>, collected 1 nodes in 79.0s
    INFO  2020-09-17 16:19:59,985   BaseBodyChainSyncer  Imported block 10882665 (168 txs) in 11.30 seconds, lagging 8 blocks | 1m32s
   DEBUG  2020-09-17 16:20:02,158     BeamBlockImporter  Preview accounts end for <BlockHeader #10882666 703a20d7>, collected 1 nodes in 65.8s

By chance, the previews should end before the block is being actively imported at least sometimes, but this pattern held for 100 blocks sequentially.

I noticed that it (indirectly) calls _request_address_nodes(), which uses loop.run_in_executor() with the default executor, so I guess if you have more than max_workers (varies between python versions) coroutines blocked on that run_in_executor() call, new ones won't be able to reach that point and you would end up on a deadlock because you can't get the data required to unblock the coroutines holding the executor's worker threads?

This could also be the reason why all the beam preview components fail to shutdown properly...

Yup, a very reasonable question to ask. I did put a thread_Q readout in the logs that suggests that we aren't backed up on threads.

Some early indication is that it has to do with the limit on how many queued up predictive nodes are allowed. Further, the queued nodes aren't cleared when the node is found through an urgent node download (or through local calculation of new state at the end of a block import).

I added some code that checks whether predictive nodes were found a bit more aggressively, to try to clear out old predictive tasks. That seems to be limiting the number of active _preview_address_load coros. It doesn't really answer the question why address previews aren't completing a little earlier, but that can be a separate issue to hunt down.