Avoid waste of blockspace with invalid blobs

Question

Avoid waste of blockspace with invalid blobs

Closed this issue a year ago · 5 comments

Right now the submit_blob extrinsic would return an error if a preconditions for blob submission are not met.

Specifically, those preconditions are:

the blob itself is less than or equal to the maximum (actually, enforced by encoding with BoundedVec)
the ordinal number of the blob is less than or equal to the maximum number of blobs.
the total blob size (including the one being submitted) must not exceed the maximum size.

Those preconditions may make sense, but they are not doing what they are supposed to do. Specifically, it looks like they are trying to allocate the blockspace for blobs, but the way it's done is very weak. Let's take a closer look.

Returning an error from the extrinsic have the following effects:

The sender is still charged with fees and the collator still gets paid.
The extrinsic data is left in the block.
The blob is not committed in the NMT.

First of all, even if the preconditions are not met, it doesn't mean that there is absolutely no space for another blob. It may still be possible (although, IMO, unlikely) that another blob can be put into the block by the collator. The blob will end up in the block but won't affect the NMT trie. In this case, the collator will get its fee for the inclusion in any case, and as such is actually is incentivised to shove as many blobs as possible into a block potentially robbing poor sequencers of their fees.

At the same time, those limits are present to make sure that there is space left in block for other extrinsics and there is some space left of witness data in POV. However, I would argue that they don't actually achieve any of that.

If the limits are meant for making sure there are enough space for witness in POV, even if the blob slots are fully packed, then it won't actually achieve that, because other types of extrinsics or logic may already fill the POV with witness data. Furthermore, there is not much sense of limiting the consumption of the pov data in this way. Anything goes as long as the block builder is able to fit into the block size limit and the POV size limit.

If the limits are meant for making sure the block space is not consumed, then they are not fit for doing that. This is because a transaction that failed the blob inclusion is still present in the block and occupies space. As was mentioned above, the collator is not incentivized to not overfill the blockspace. Also the same dynamic applies here, that it doesn't matter in the end, how exactly collator decides to allocate the block as long as it's valid (and after all, the collator may forfeit its block creation privilege anyway).

The actual effect, or the "punishment", of including the blob when it's not supposed to be, is not including the blob into the NMT. This is silly, because:

It's just the runtime being anal towards the, I guess, block builder. It just refuses to do some job when it's able to due to a situation akin to making a minor mistake when filling a form. Everything that's needed is there, just do the fucking job.
It's misattributed. The punishment is directed towards the blob submitter for the result of action of the block producer.
Even if misattributed, it fails even at that punishment. Not every rollup is interested in creating inclusion proofs, so they may as well use the failed extrinsics.

So the current construction clearly doesn't work. On top of that, it also gives rise some complications. For example, see #71.

All of that suggests that we should probably treat the blobs present in the blob as unconditionally successful. It's should be up to the collator to decide how much POV and block space it wants to allocate on each part of block, and the collator should be incentivized to do the "right thing", which is include more blobs.

However, there are some constraints that should be probably enforced. One is the number of blobs. Firstly, they spend some compute. Sure, that could be solved by weight. However, in the future we may get to sugondat multicore and unbounded number of blobs in a worker may overwhelm the aggregator.

The maximum size of a blob may also be worth to enforce.

There is another constraint we discussed before: the blobs should be ideally ordered by their namespace.

So we have those block validity constraints how do we want to satisfy them.

Panic is one. It actually would help keeping the precondition violating blobs out of a block. However, those actually typically pose a DoS vector, because, an extrinsic that panics is not included in the block and thus the block doesn't charge the sender the fees and the sender can create lots of those blobs. I am also not sure about the behavior of mempool in this case. So when a transaction is validated and its dispatchable fails, how that should be treated? I guess it should ban that blob transaction, because they are always included based off a fresh block state (and thus only the blob properties can lead to panics and those won't change with revalidation). Then, how that would work during block building? If when the submit blob extrinsic panics, should that tx retried or banned? If the transaction panicked because the block was full, then the blob could fit the next blob and if it was banned then that blob could be forgotten by the network. However, if the blob is too big and that may be an attack where the adversary gets free tries. Despite that, it won't allow us to enforce the constraint of block ordering.

So it's clear to me that panicking is not the right tool for that.

Perhaps, something like ValidateUnsigned would work. Well, in our case here, there is a signer, I just don't know if there is a corresponding API for that. Such an API definitely would be more preferable, because the constraints should be the same as of the normal signed transaction plus some extra constraints.

Next, the nuclear option, is a custom block authorship. That would provide us with the greatest flexibility. For example, maybe we want to split the submit_blob from blob data. submit_blob would only carry the blob hash. A full block has a payload space for the blob preimages. A valid block always has all the blob preimages. POV probably still have to have all blob preimages. On p2p layer between the sugondat node, maybe sometimes the blobs can be omitted. Those kind of tricks.

That said, this is probably more for the future.

Answer 1 · 2023-11-29T19:38:41.000Z

an extrinsic that panics is not included in the block and thus the block doesn't charge the sender the fees and the sender can create lots of those blobs. I am also not sure about the behavior of mempool in this case

Panics are the right solution, in my opinion, because we want the blob not to appear in the block at all if it is too large and therefore the whole block should be invalid if it is: parachain nodes and the PVF should reject it. The only way to achieve this is to panic.

There are runtime hooks for the TaggedTransactionQueue which can be used to expunge blobs which exceed MaximumBlobSize. These will never be accepted in the mempool of honest nodes or attempted to be pushed to blocks, with one edge case that I'll discuss later: when MaximumBlobSize decreases.

To handle the case where a blob is valid but cannot fit within MaxTotalBlobSize or the limit on the number of blobs has been reached, we can return InvalidTransaction::ExhaustsResources from BlockBuilderApi::apply_extrinsic. I've looked at the default Substrate block authorship code, and the behavior resulting from this is to skip the transaction temporarily, but not expunge it from the pool. Other errors would cause the blob to be expunged from the pool which we don't want.

I'll return to the edge case of MaximumBlobSize decreasing. We don't need to worry about the situation where this increases, because the valid blobs after an increase are a superset of the blobs before the increase. However, when the maximum size decreases, some transactions which previously were valid will no longer be. Although the transaction pool periodically re-evaluates all pending transactions, there are races around the exact time of the MaximumBlobSize decrease where the old transactions have not yet been expunged from the pool. A node authoring a block may attempt to push a blob whose size is too large as a result. A runtime panic while submitting the transaction is acceptable in this situation. The result of the panic during authorship is for the blob to be expunged from the pool, which is the intended behavior as it is no longer valid.

So as concrete steps:

Panic in the blobs pallet when individual blob and aggregate blob sizes are overrun
TaggedTransactionQueue API should return TransactionValidity::Invalid for blob transactions that are individually too large
BlockBuilderApi::apply_extrinsic should return ExhaustsResources when the total blob size or number limits are reached.

This solution does not behave perfectly, because the basic authorship logic in Substrate only makes up to 8 attempts when ExhaustsResources is reached before giving up on authoring the block any more. There may be attacks where an adversary constructs a set of transactions causing the node to pack the block very badly, and this can only be solved with more intelligent iteration over ready transactions.

Answer 2 · 2023-11-30T16:08:30.000Z

Panics are the right solution, in my opinion, because we want the blob not to appear in the block at all if it is too large and therefore the whole block should be invalid if it is: parachain nodes and the PVF should reject it. The only way to achieve this is to panic.

Well, it seems that we are in agreement. I did not mean that we must not use panics at all. What I said is that we should not rely exclusively on panics. Your suggestion of TaggedTransactionQueue is actually the API I was looking for here

Well, in our case here, there is a signer, I just don't know if there is a corresponding API for that.

The outlined solution makes perfect sense.

Answer 3 · 2023-11-30T22:26:42.000Z

Okay, let's implement that solution, then.

The slightly annoying part of the solution is that it requires each runtime to incorporate the same changes in the runtime API implementations. Hopefully we can write some convenience functions in pallet_blobs to achieve the desired effect.

Answer 4 · 2023-12-04T16:56:38.000Z

A related problem is: https://github.com/thrumdev/sugondat/issues/71

Answer 5 · 2023-12-18T22:39:40.000Z

This is tested and can be closed now.