base-org/pessimism

L1/L2 geth block polling algorithm could fall out of sync with head

Closed this issue · 1 comments

Edge Case

An arbitrary node failure occurs that causes an execution node on base ethereum to temporarily fall out of sync with the majority network for h0...hn blocks. If Pessimism is actively subscribed to this node, it'll continue to poll for a new block at height h0 until retrieved. Once retrieved, the oracle implementation will perform a time wait before retrieving a block at h1. Given that these time waits are typically proportionally to the block production rate, the oracle could be operating on a false head with no perception that the application is out-of-sync.

Mitigation(s)

  • Check for the most recent block number every poll step and perform an immediate backfill operation if the oracle is >1 block behind head; this would require more node calls and could over-exhaust down-stream subsystems in real application if the backfill requires many blocks.
  • Reduce default polling times to ensure the application will catchup; this would require more CPU operation.

Thinking about this, reducing the poll time is something we've already implemented as configurable. The poll to check height each time, we could potentially have it also be configurable, so that it can enabled during system start up or so, but it would throw a warning saying more API hits are to be expected.