thrumdev/blobs

Make blob submission more robust

Opened this issue · 2 comments

Right now and after #76 the submit blob is not very robust. Specifically, I fear that the user would expect it to be a fire-and-forget and it's certainly not. We should not forget that this code is going to be used by sequencers and thus should be of appropriate quality.

Can you describe in which ways it is not robust? The following comes to mind:

  1. Connections to the sugondat node can be dropped. What should happen in this case is a re-connection behind the scenes, as well as a re-submission. #76 does this, no?
  2. Blobs of size N may be dropped when the configuration changes to size less than N. In this case, the shim should detect that the limit has changed and return an error

What am I missing? I'd expect that (1) is already rare. In my experience, RPC connections to nodes are extremely stable - I have services running 24/7 that maintain persistent connections to nodes without issues. (2) is also quite rare, as I expect the configuration to change only very rarely.

Since nothing is certain in the world, we can't provide 100% reliability. But I would expect that even the basic implementation we currently have is 99.99 or 99.999% reliable.

First of all, no, #76 doesn't try to do resubmission.

Then, I agree that losing connections is a rare event. Still, we should handle that, at least in the future. It's not about just losing a connection (although that indeed can happen), but also things like:

  1. crashes of either sugondat node or shim, due to things that we control (bugs, leaks) or that we don't control (misconfigurations, OS bugs).
  2. underlying hardware maintenance.
  3. upgrading versions of those.

Not resilent implementation could lead to not including a blob or including a blob more than once. A rollup may choose to slash that. Besides, DA is said to become a big cost factor, so you don't want to splurge.

With that said, what logic does the submission code do right now. It:

  1. Retrieves the nonce for the account in question.
  2. Creates the extrinsic (aka transaction in polkadot lingo) data
  3. Signs it with the signer
  4. Pushes the extrinsics into the mempool
  5. Subscribes to the result of that extrinsic

I don't understand many parts of this process. First, it's not clear what nonce does it return. I would put my bet that it looks up the onchain value. That means that submitting several blobs at the same time probably won't work.

If we employed a naive resubmission, then it would be possible for one blob to land on-chain twice.

Maybe we should try to be smart about it and move the nonce control into the shim. I think it would be reasonable to declare the shim as the sole controller of the submitter private key.

After we submitted the blob tx into the mempool, it's not 100% guaranteed it will get to the block proposer. Maybe a bug due to which the node lost its peers, maybe a restart that led to a wipe of the mempool. Either case, there is a chance of the extrinsic is lost.

Next, we subscribe to the results of the extrinsics. I am not sure how exactly it's implemented, but I suspect there is no magic. Something along the lines of just scanning the blocks for the sought-after tx hash. I can't call that robust, because for one it won't work across restarts of shim or naive reconnecting/resubmission.

Here is a strawman how to make it more robust:

  1. There is a submitter component. It's configured with a submitter key. It is aware of the nonce of the submit account.
  2. That component can accept blobs.
  3. Once accepted, it would create the result listener, it would pick a nonce, create an extrinsic and sign it, which means now we have the tx hash. We persist the new state: new nonce, new blob.
  4. Then we submit the blob through RPC. Note, this operation is idempotent and as such is safe to do multiple times (as long as it's literally the tx with the same hash).
  5. There is a task spawned by submitter which scans all the tx hashes landed onto chain. The ones that landed are removed from the durable log. The corresponding in-memory listeners, if alive, are notified about this event. This process of course should survive restarts.
  6. From time to time, the blobs should be resubmitted, to ensure they don't get stuck.

Even with this construction, there are some concerns that I gleaned over left:

  1. The tips. This so far was not a problem in polkadot community, but hopefully it will. It may be desirable to handle the
  2. Mortality. Either we submit immortal txs or we handle resigning with the new nonces.

It's worth to mention the elephant in the room. I don't think the adapters resilent themselves. I mean, sovereign just kills itself when it receives at least one error from the DA adapter.

I don't think they would benefit from the resilency for the time being.