Handle messages for future instances that can't be validated
Closed this issue · 3 comments
Message validation depends on information from the base tipset for an instance, such as the power table and signing keys. This means that when participants drift a little out of sync, some participant could receive a message for a future instance which it cannot yet validate. How should we handle this?
Message validation is currently exposed as a synchronous method so that the network layer can validate most messages as they arrive, and refuse to propagate invalid ones. If a message can't be validated (yet), should the network layer also refuse to propagate it? This could hamper propagation for fast running nodes.
The message cannot be rejected locally as it may be valid and therefore necessary for progress. area for progress. It must be queued somewhere for revalidation when the appropriate instance is started.
At the moment, the API punts that local queuing out to integration, expecting it to re-deliver them. This might not be the best (see discussion). This also causes complexity in simulation.
What if the participant locally queues messages for future instances, locally revalidates them when those instances start? The network layer can't get information about those message validity to inform propagation, but is it too late for that anyway? It also opens up some denial of service risk. This is something we have to handle somewhere and there is an existing issue about it #12.
What if the participant locally queues messages for future instances, locally revalidates them when those instances start?
I would do this to start with (that is how I proposed in my now deprecated PR on multi-instance consensus). DOS attacks are always a possibility even if we do not propagate, but yes we can mitigate then if we modify the network-layer propagation to only propagate upon validation. Either way, locally buffering at the f3 level seems sensible as a first step, to be extended by network propagation control.
Outcome of sync discussion: Let's merge #124 and create a tracking issue for later replacement. @ranchalp will look at how Aptos/Sui/Cosmos are handling this and draft a more formal proposal, but we're leaning towards having a queue on the sender instead and a request-response/broadcast protocol to suppress the DoS vector.