nucypher/nucypher

Event scanning seems to leave node unreachable after node can't reach other node(s) in DKG cohort

Closed this issue · 1 comments

This issue arose during a ritual initiation on lynx. One of the supposed "active" staking providers, no longer has a corresponding running node (staking provider 0x24dbb0BEE134C3773D2C1791d65d99e307Fe86CF), but gets sampled anyway because it is considered active by TACoChildApplication.

Each node in the cohort tries to reach all other nodes in the cohort when performing round 1 of the protocol. However, since one of the nodes in the cohort is not reachable, other nodes seem to raise an exception during perform_round_1 when trying learn about the not running node (i.e. block_until_specific_nodes_are_known()), and then the event scanning task crashes, and tries to restart itself. The node again tries to learn about this non-running node, and the cycle repeats consistently.

It seems the event scanner task just repeatedly crashes and restarts. This occurs because scan_chunk throws an exception when nodes in the cohort can't be contacted.

This cycle seems to render the node unreachable i.e. the status page for nodes caught in this loop can't be hit, and porter can't ping the node either.

(0x890069) Scanning events in block range 44811183 - 44812107
performing round 1 of DKG ritual #3 from blocktime 1705343058 with authority 0x3B42d26E19FF860bC4dEbB920DD8caA53F93c600.
Error during event hook: After 60 seconds and 0 rounds, didn't find these 1 nodes: {'0x24dbb0BEE134C3773D2C1791d65d99e307Fe86CF'}
Error during ritual event scanning: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 66, in handle_errors
    self.start(now=True)
  File "/usr/local/lib/python3.12/site-packages/nucypher/utilities/task.py", line 28, in start
    d = self._task.start(interval=self.INTERVAL, now=now)
  File "/usr/local/lib/python3.12/site-packages/twisted/internet/task.py", line 206, in start
    self()
  File "/usr/local/lib/python3.12/site-packages/twisted/internet/task.py", line 251, in __call__
    d = maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
  File "/usr/local/lib/python3.12/site-packages/twisted/internet/defer.py", line 209, in maybeDeferred
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 60, in run
    self.scanner()
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 431, in scan
    self.__scan(
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 406, in __scan
    result, total_chunks_scanned = self.scanner.scan(start_block, end_block)
  File "/usr/local/lib/python3.12/site-packages/nucypher/utilities/events.py", line 343, in scan
    actual_end_block, end_block_timestamp, new_entries = self.scan_chunk(current_block, estimated_end_block)
  File "/usr/local/lib/python3.12/site-packages/nucypher/utilities/events.py", line 249, in scan_chunk
    processed = self.process_event(event=evt, get_block_when=get_block_when)
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 43, in process_event
    hook(event, get_block_when)
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 399, in _handle_ritual_event
    d = self.__execute_action(ritual_event=ritual_event, timestamp=timestamp)
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 384, in __execute_action
    return task()
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/trackers/dkg.py", line 378, in task
    self.actions[event_type](timestamp=timestamp, **formatted_kwargs)
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/actors.py", line 426, in perform_round_1
    nodes, transcripts = list(zip(*self._resolve_validators(ritual)))
  File "/usr/local/lib/python3.12/site-packages/nucypher/blockchain/eth/actors.py", line 307, in _resolve_validators
    self.block_until_specific_nodes_are_known(
  File "/usr/local/lib/python3.12/site-packages/nucypher/network/nodes.py", line 707, in block_until_specific_nodes_are_known
    raise self.NotEnoughTeachers(
nucypher.network.nodes.NotEnoughTeachers: After 60 seconds and 0 rounds, didn't find these 1 nodes: {'0x24dbb0BEE134C3773D2C1791d65d99e307Fe86CF'}

Restarting event scanner task!

Fixed via #3390 .