_preview_address_load() never completing
carver opened this issue · 3 comments
What is wrong?
There's a leak in the number of active coroutines. After applying this patch to async-service
: ethereum/async-service#77 it raised:
File "/home/jcarver/code/trinity/trinity/sync/full/chain.py", line 1109, in _preview_ready_blocks
await self._block_importer.preview_transactions(
File "/home/jcarver/code/trinity/trinity/sync/beam/chain.py", line 802, in preview_transactions
self.manager.run_task(self._preview_address_load, header, parent_state_root, transactions)
File "/home/jcarver/code/async-service/async_service/asyncio.py", line 337, in run_task
self._common_run_task(task)
File "/home/jcarver/code/async-service/async_service/base.py", line 289, in _common_run_task
self._add_child_task(task.parent, task)
File "/home/jcarver/code/async-service/async_service/base.py", line 304, in _add_child_task
raise TooManyChildrenException(
async_service.exceptions.TooManyChildrenException: Tried to add more than 1000 child tasks. Most common tasks: [('_preview_address_load[daemon=False]', 1000), ('run[daemon=False]', 1)]
How can it be fixed
Some options:
- Patch the problem, by putting a timeout on
_preview_address_load()
- Figure out why it's not finishing. Maybe... Wrong state root? Predictive hang that is never woken up by urgent data? Something else?
Hm, well something seems to be causing a hang of some kind. Basically, _preview_address_load
is never completing until the block that it is previewing is under active review. I added a log at the end of every preview, and the logs look like this:
INFO 2020-09-17 16:19:01,009 BaseBodyChainSyncer Imported block 10882660 (112 txs) in 12.22 seconds, lagging 6 blocks | 1m22s
DEBUG 2020-09-17 16:19:03,356 BeamBlockImporter Preview accounts end for <BlockHeader #10882661 8d93389a>, collected 73 nodes in 65.8s
INFO 2020-09-17 16:19:12,006 BaseBodyChainSyncer Imported block 10882661 (167 txs) in 11.00 seconds, lagging 5 blocks | 1m24s
DEBUG 2020-09-17 16:19:13,670 BeamBlockImporter Preview accounts end for <BlockHeader #10882662 a2177b82>, collected 1 nodes in 65.5s
INFO 2020-09-17 16:19:24,694 BaseBodyChainSyncer Imported block 10882662 (148 txs) in 12.69 seconds, lagging 4 blocks | 1m28s
DEBUG 2020-09-17 16:19:25,955 BeamBlockImporter Preview accounts end for <BlockHeader #10882663 0d05c77b>, collected 3 nodes in 73.0s
INFO 2020-09-17 16:19:35,561 BaseBodyChainSyncer Imported block 10882663 (180 txs) in 10.87 seconds, lagging 3 blocks | 1m28s
DEBUG 2020-09-17 16:19:37,522 BeamBlockImporter Preview accounts end for <BlockHeader #10882664 707dc755>, collected 75 nodes in 69.0s
INFO 2020-09-17 16:19:48,685 BaseBodyChainSyncer Imported block 10882664 (178 txs) in 13.12 seconds, lagging 8 blocks | 1m37s
DEBUG 2020-09-17 16:19:50,441 BeamBlockImporter Preview accounts end for <BlockHeader #10882665 9330f3aa>, collected 1 nodes in 79.0s
INFO 2020-09-17 16:19:59,985 BaseBodyChainSyncer Imported block 10882665 (168 txs) in 11.30 seconds, lagging 8 blocks | 1m32s
DEBUG 2020-09-17 16:20:02,158 BeamBlockImporter Preview accounts end for <BlockHeader #10882666 703a20d7>, collected 1 nodes in 65.8s
By chance, the previews should end before the block is being actively imported at least sometimes, but this pattern held for 100 blocks sequentially.
I noticed that it (indirectly) calls _request_address_nodes()
, which uses loop.run_in_executor()
with the default executor, so I guess if you have more than max_workers
(varies between python versions) coroutines blocked on that run_in_executor()
call, new ones won't be able to reach that point and you would end up on a deadlock because you can't get the data required to unblock the coroutines holding the executor's worker threads?
This could also be the reason why all the beam preview components fail to shutdown properly...
Yup, a very reasonable question to ask. I did put a thread_Q
readout in the logs that suggests that we aren't backed up on threads.
Some early indication is that it has to do with the limit on how many queued up predictive nodes are allowed. Further, the queued nodes aren't cleared when the node is found through an urgent node download (or through local calculation of new state at the end of a block import).
I added some code that checks whether predictive nodes were found a bit more aggressively, to try to clear out old predictive tasks. That seems to be limiting the number of active _preview_address_load
coros. It doesn't really answer the question why address previews aren't completing a little earlier, but that can be a separate issue to hunt down.