bug: RLN inconsistent state rejects valid messages
Opened this issue · 3 comments
Originally reported in #2870 this issue refers to the same problem but expands on the root cause of it, with a possible solution to fix it.
In short, valid messages containing valid RLN proofs are rejected by some nodes with invalid message: provided root does not belong to acceptable window of roots
. This bug is difficult to reproduce but shows up more frequently when new RLN memberships are being inserted in the contract. With a static amount of memberships this won't show up.
The problem is the following:
- Each node contains a Merkle Tree with all RLN memberships.
- The tree is constructed with onchain information.
- Blockchain is queried every DefaultBlockPollRate.
- This means that when a new membership is added to the contract, nwaku nodes will add it to the tree at different times, varying from 0 to 6 seconds (
DefaultBlockPollRate
). - During this interval, nodes may have different Merkle trees, with different roots.
- Hence leading to the
invalid message: provided root does not belong to acceptable window of roots
error in some nodes.
As shown in the image, we have some sort of race condition during this interval. If nwaku1
creates a message at t1
and propagates it to the network, nwaku2
will flag it as invalid and nwaku3
as valid. There is an inconsistency between the nodes, where the merkle trees are not guaranteed to be the same.
Known the issue, there would be 2 ways to mitigate it (but not completely fix it):
- Reduce
DefaultBlockPollRate
. A lower value would reduce how likely this is to happen, since it will shorten the "ambiguity window". But the issue will still be there. - Use an event-based approach (eg websockets). Subscribe to the tip of the blockchain, and for every new block head event, update the tree. This reduces the "ambiguity window", but in theory the problem can still happen. And well, syncing state based on event-based-ws-like showed clear problems in the past as afaik there are not guarantees that you won't miss an event.
So my proposal to fix this would be the following. We need to somehow connect RLN timestamps (aka epochs) to blockchain timestamps (aka blocks or slots). This would allow a proper ordering of events, so we can know if the RLN message we are receiving is from the future. In other words, if we receive a RLN message with timestamp 10 but we have just synced the blockchain till timestamp 9, then we know that we can't process this message, since we are not up to date. In this case, the node would need to sync first and validate after. With this, we will no longer reject a valid message, since we first sync, process after. The solution has to be ofc well designed to avoid unnecessary delays.
It's up for discussion how "connecting RLN timestamps to blockchain timestamps" will work. If using Ethereum L1, slots could be an option, since the happen at fixed 12s intervals. Unsure how it will work in L2s. And well, as a requirement we need a solution that is portable across chains/layers.
Use an event-based approach (eg websockets). Subscribe to the tip of the blockchain, and for every new block head event, update the tree. This reduces the "ambiguity window", but in theory the problem can still happen. And well, syncing state based on event-based-ws-like showed clear problems in the past as afaik there are not guarantees that you won't miss an event.
this approach is probably less robust than our current implementation
As discussed in our call, the option of linking RLN epochs to the underlying blockchain's slot is a promising idea, with some noteworthy tradeoffs -
- the slot time will become our epoch size, i.e with ethereum mainnet, it's 12 seconds, and therefore the rln epoch size should be aligned to that.
- unclear if this solution will work with chains that do not have slot-based block production/fixed-time intervals for block production
- This will need a reconciliation mechanism for missed block proposals
Here's what the pseudocode would look like
[cached] lastProcessedBlockNumber
[cached] lastProcessedBlockTimestamp
[constant] blockSlotSize
func onNewBlock(block) ->
processBlock(block)
lastProcessedBlockNumber = block.number
lastProcessedBlockTimestamp = block.timestamp
func validateMessage(wakuMessage) ->
validateEpoch(wakuMessage)
if wakuMessage.timestamp > lastProcessedBlockTimestamp + blockSlotSize:
# this means that
# we may be lagging behind the block that the sender has processed
# so, we force process the latest block, and then verify the proof
# this leads to a probable dos vulnerability where the sender intentionally spams
# a node with this condition, and makes unnecessary calls to the rpc provider/ethereum node
forceProcessLatestBlock()
verifyProof(wakuMessage.proof)
We will have to do some preliminary analysis on variance of block-times of L2s we may be interested in for mainnet deployment to estimate actual values of blockSlotSize
, and subsequently, the epochSize
cc: @alrevuelta
@rymnc thanks! In general I like the idea, but perhaps there is an attack vector here.
wakuMessage.timestamp
is set by the sender. One attacker could artificially set a timestamp in the future. This will force forceProcessLatestBlock
to be executed. Perhaps not a big deal because if the node is already in sync, forceProcessLatestBlock
will return without any onchain call? But perhaps we need to add some checks in here to prevent that. For example, reject timestamps greater than the lastHeadBlockTimestamp
, but well this has other problems. If the clock of the sender is off by a bit, we may reject valid messages.
On the other hand, I'm concerned because this still doesn't deterministically ensure messages are processed correctly since it relies on the local timestamp of each node. The shorter the block time, the more impact a biased clock will have.
But in any case, I think it's a great way to start.