ethereum/trinity

Trinity crashes because some components can't establish an event bus connection to another component

Closed this issue · 1 comments

I'm seeing this quite frequently when running with metrics enabled. Most components always establish an event bus connection to the metrics component, but often one of them fails to do so and brings trinity down.

DEBUG  2020-11-18 16:06:35,203       EventBusService  EventBus Endpoint networking connecting to other Endpoints: bmetrics
[...]
DEBUG  2020-11-18 16:06:35,329       EventBusService  EventBus Endpoint discovery connecting to other Endpoints: bmetrics
[...]
DEBUG  2020-11-18 16:06:35,446        RemoteEndpoint  RemoteEndpoint connection established: networking <-> bmetrics
[...]
WARNING  2020-11-18 16:07:05,330       EventBusService  Failed to connect discovery to one of bmetrics:
<bound method TrioIsolatedComponent._do_run of <trinity.components.builtin.preferred_node.component.PreferredNodeComponent object at 0x7f16cf4d5310>> raised an unexpected exception
Traceback (most recent call last):
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_timeouts.py", line 105, in fail_at
   yield scope
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/lahja/trio/endpoint.py", line 677, in connect_to_endpoints
   await self.wait_until_connected_to(config.name)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/lahja/base.py", line 645, in wait_until_connected_to
   await self._remote_connections_changed.wait()
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_sync.py", line 746, in wait
   await self._lot.park()
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_parking_lot.py", line 136, in park
   await _core.wait_task_rescheduled(abort_fn)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_traps.py", line 166, in wait_task_rescheduled
   return (await _async_yield(WaitTaskRescheduled(abort_func))).unwrap()
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/outcome/_sync.py", line 111, in unwrap
   raise captured_error
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_run.py", line 1096, in raise_cancel
   raise Cancelled._create()
trio.Cancelled: Cancelled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/asyncio_run_in_process/_child.py", line 205, in run_process
   runner(async_fn, args, to_parent)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/asyncio_run_in_process/_child_trio.py", line 63, in _run_on_trio
   result = trio.run(_do_async_fn, async_fn, args, to_parent)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_run.py", line 1896, in run
   raise runner.main_task_outcome.error
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/asyncio_run_in_process/_child_trio.py", line 55, in _do_async_fn
   result = await async_fn(*args)
 File "/home/salgado/src/snakecharmers/trinity/trinity/extensibility/trio.py", line 80, in _do_run
   nursery.cancel_scope.cancel()
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_generator/_util.py", line 53, in __aexit__
   await self._agen.athrow(type, value, traceback)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/trio.py", line 411, in background_trio_service
   await manager.stop()
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_core/_run.py", line 741, in __aexit__
   raise combined_error_from_nursery
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/trio.py", line 205, in run
   raise trio.MultiError(
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/base.py", line 324, in _run_and_manage_task
   await task.run()
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/async_service/trio.py", line 76, in run
   await self._async_fn(*self._async_fn_args)
 File "/home/salgado/src/snakecharmers/trinity/trinity/extensibility/event_bus.py", line 103, in _auto_connect_new_announced_endpoints
   await endpoint.connect_to_endpoints(*endpoints_to_connect_to)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/lahja/trio/endpoint.py", line 677, in connect_to_endpoints
   await self.wait_until_connected_to(config.name)
 File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
   self.gen.throw(type, value, traceback)
 File "/home/salgado/virtualenvs/trinity/lib/python3.8/site-packages/trio/_timeouts.py", line 107, in fail_at
   raise TooSlowError
trio.TooSlowError

ISTM that in the case of trio-based components the connection attempt will hang until it times out (as above), but for asyncio-based components we immediately get a resource unavailable error:

   DEBUG  2020-11-18 08:59:14,048       EventBusService  EventBus Endpoint bbeam-sync-chain-preview-3 connecting to other Endpoints: bmetrics
   DEBUG  2020-11-18 08:59:14,043       EventBusService  EventBus Endpoint bbeam-sync-chain-execution connecting to other Endpoints: bmetrics
   WARNING  2020-11-18 08:59:14,049       EventBusService  Failed to connect bbeam-sync-chain-preview-3 to one of bmetrics: [Errno 11] Resource temporarily unavailable
   WARNING  2020-11-18 08:59:14,044       EventBusService  Failed to connect bbeam-sync-chain-execution to one of bmetrics: [Errno 11] Resource temporarily unavailable

I've never seen that with any components other than the metrics one, btw.

This is not specific to the metrics component, and is actually causing random test failures: https://app.circleci.com/pipelines/github/ethereum/trinity/7828/workflows/8e909c55-84a7-4bfe-9fe7-3f668c90b824/jobs/294983

DEBUG  12-01 08:14:33  async_process_runner.py  b'\x1b[1m\x1b[33m WARNING  2020-12-01 08:14:33,150       EventBusService  Failed to connect bnewblockcomponent to one of discovery: \x1b[0m\n'
   DEBUG  12-01 08:14:33  async_process_runner.py  b'/home/circleci/repo/.tox/py37-long_run_integration/lib/python3.7/site-packages/cytoolz/compatibility.py:6: DeprecationWarning: The toolz.compatibility module is no longer needed in Python 3 and has been deprecated. Please import these utilities directly from the standard library. This module will be removed in a future release.\n'
   DEBUG  12-01 08:14:33  async_process_runner.py  b'  category=DeprecationWarning)\n'
   DEBUG  12-01 08:14:33  async_process_runner.py  b'<Manager[TrioEventBusService] flags=SRcfe>: task _auto_connect_new_announced_endpoints[daemon=True] exited with error: \n'