Trinity crashes because some components can't establish an event bus connection to another component
Closed this issue · 1 comments
gsalgado commented
I'm seeing this quite frequently when running with metrics enabled. Most components always establish an event bus connection to the metrics component, but often one of them fails to do so and brings trinity down.
DEBUG 2020-11-18 16:06:35,203 EventBusService EventBus Endpoint networking connecting to other Endpoints: bmetrics
[...]
DEBUG 2020-11-18 16:06:35,329 EventBusService EventBus Endpoint discovery connecting to other Endpoints: bmetrics
[...]
DEBUG 2020-11-18 16:06:35,446 RemoteEndpoint RemoteEndpoint connection established: networking <-> bmetrics
[...]
WARNING 2020-11-18 16:07:05,330 EventBusService Failed to connect discovery to one of bmetrics:
<bound method TrioIsolatedComponent._do_run of <trinity.components.builtin.preferred_node.component.PreferredNodeComponent object at 0x7f16cf4d5310>> raised an unexpected exception
Traceback (most recent call last):
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_timeouts.py", line 105, in fail_at
yield scope
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/lahja/trio/endpoint.py", line 677, in connect_to_endpoints
await self.wait_until_connected_to(config.name)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/lahja/base.py", line 645, in wait_until_connected_to
await self._remote_connections_changed.wait()
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_sync.py", line 746, in wait
await self._lot.park()
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_parking_lot.py", line 136, in park
await _core.wait_task_rescheduled(abort_fn)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_traps.py", line 166, in wait_task_rescheduled
return (await _async_yield(WaitTaskRescheduled(abort_func))).unwrap()
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/outcome/_sync.py", line 111, in unwrap
raise captured_error
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_run.py", line 1096, in raise_cancel
raise Cancelled._create()
trio.Cancelled: Cancelled
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/asyncio_run_in_process/_child.py", line 205, in run_process
runner(async_fn, args, to_parent)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/asyncio_run_in_process/_child_trio.py", line 63, in _run_on_trio
result = trio.run(_do_async_fn, async_fn, args, to_parent)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_run.py", line 1896, in run
raise runner.main_task_outcome.error
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/asyncio_run_in_process/_child_trio.py", line 55, in _do_async_fn
result = await async_fn(*args)
File "/home/salgado/src/snakecharmers/trinity/trinity/extensibility/trio.py", line 80, in _do_run
nursery.cancel_scope.cancel()
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_generator/_util.py", line 53, in __aexit__
await self._agen.athrow(type, value, traceback)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/trio.py", line 411, in background_trio_service
await manager.stop()
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_run.py", line 741, in __aexit__
raise combined_error_from_nursery
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/trio.py", line 205, in run
raise trio.MultiError(
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/base.py", line 324, in _run_and_manage_task
await task.run()
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/trio.py", line 76, in run
await self._async_fn(*self._async_fn_args)
File "/home/salgado/src/snakecharmers/trinity/trinity/extensibility/event_bus.py", line 103, in _auto_connect_new_announced_endpoints
await endpoint.connect_to_endpoints(*endpoints_to_connect_to)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/lahja/trio/endpoint.py", line 677, in connect_to_endpoints
await self.wait_until_connected_to(config.name)
File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_timeouts.py", line 107, in fail_at
raise TooSlowError
trio.TooSlowError
ISTM that in the case of trio-based components the connection attempt will hang until it times out (as above), but for asyncio-based components we immediately get a resource unavailable error:
DEBUG 2020-11-18 08:59:14,048 EventBusService EventBus Endpoint bbeam-sync-chain-preview-3 connecting to other Endpoints: bmetrics
DEBUG 2020-11-18 08:59:14,043 EventBusService EventBus Endpoint bbeam-sync-chain-execution connecting to other Endpoints: bmetrics
WARNING 2020-11-18 08:59:14,049 EventBusService Failed to connect bbeam-sync-chain-preview-3 to one of bmetrics: [Errno 11] Resource temporarily unavailable
WARNING 2020-11-18 08:59:14,044 EventBusService Failed to connect bbeam-sync-chain-execution to one of bmetrics: [Errno 11] Resource temporarily unavailable
I've never seen that with any components other than the metrics one, btw.
gsalgado commented
This is not specific to the metrics component, and is actually causing random test failures: https://app.circleci.com/pipelines/github/ethereum/trinity/7828/workflows/8e909c55-84a7-4bfe-9fe7-3f668c90b824/jobs/294983
DEBUG 12-01 08:14:33 async_process_runner.py b'\x1b[1m\x1b[33m WARNING 2020-12-01 08:14:33,150 EventBusService Failed to connect bnewblockcomponent to one of discovery: \x1b[0m\n'
DEBUG 12-01 08:14:33 async_process_runner.py b'/home/circleci/repo/.tox/py37-long_run_integration/lib/python3.7/site-packages/cytoolz/compatibility.py:6: DeprecationWarning: The toolz.compatibility module is no longer needed in Python 3 and has been deprecated. Please import these utilities directly from the standard library. This module will be removed in a future release.\n'
DEBUG 12-01 08:14:33 async_process_runner.py b' category=DeprecationWarning)\n'
DEBUG 12-01 08:14:33 async_process_runner.py b'<Manager[TrioEventBusService] flags=SRcfe>: task _auto_connect_new_announced_endpoints[daemon=True] exited with error: \n'