polkadot-fellows/RFCs

Adding a `CoreIndex` commitment to candidate receipts

Closed this issue ยท 23 comments

Starting this pre-RFC conversation to gather some feedback on some changes of CandidateReceipt and CommitedCandidateReceipt primitives.

The necessary changes are:

  • add a commitment to a CoreIndex in CandidateCommitments
  • add the a core_index: CoreIndex field in CandidateDescriptor

These are needed to remove the limitations of only using a trusted collator set for elastic scaling parachains. Without a CoreIndex commitment in the candidate receipt it is possible for malicious collators to spam relay chain by sending the same collation to all backing groups assigned to a parachain at a RCB. In such a scenario elasticity is effectively disabled as all backing groups would back the same candidate in parallel instead of multiple chained candidates.

The candidate receipt primitive is used across networking protocols, Parachains Runtime, node implementation, collators and even tooling. Any breaking change here is very hard to deploy n practice without upgrading everything at the same time or breaking anyone. So, the following is an approach without breaking things but which might be considered hacky.

Please keep in mind that this is very importat for short/medium term in the context of the Elastic Scaling feature. As such, a proposal around a more flexible, backwards compatible and future-proof (allowing for more dramatic changes) is out of scope in this proposal, but otherwise something that I am working already on.

Proposed approach

Changes in CandidateDescriptor:

  • reclaim 32 bytes from collator: CollatorId and 64 bytes from signature: CollatorSignature fields as reserved fields
  • use 4 bytes out of this reclaimed space for a new core_index: u32 field.

CandidateCommitments doesn't really need to be changed, but one idea is to define a special XCM instruction like CommitCoreIndex(u32) that is appended as the last message in the upward_messages vec. The message wouldn't ever be executed, but would just serve as a commitment to a specific CoreIndex.

I have also thought about appending an additional core_index: u32 field to the end of the structure but that doesn't really seem to work because we compute the CandidateHash with hash(CandidateDescriptor, hash(CandidateCommitments)) . Older non-upgraded validator nodes for example would get a different commitment hash if they don't decode and hash the whole thing and this would break consensus.

Any better and less hacky ideas especially regarding the XCM stuff would help me a lot.

  • reclaim 32 bytes from collator: CollatorId and 64 bytes from signature: CollatorSignature fields as reserved fields

Why? Don't we need this for on demand to verify that the claimant is the one that signed the collation?

Older non-upgraded validator nodes for example would get a different commitment hash if they don't decode and hash the whole thing and this would break consensus.

You want to hack this into the protocol? I don't get why you not introduce a new version of the networking protocol to be able to append the core_index.

  • reclaim 32 bytes from collator: CollatorId and 64 bytes from signature: CollatorSignature fields as reserved fields

Why? Don't we need this for on demand to verify that the claimant is the one that signed the collation?

No, that was the initial idea AFAIK, but we now want to only rely on the parachain runtime to decide who builds the parachain block. What collator provides the collation is irrelevant , it just has to be valid.

There are plans to have some counter measures against collators spamming with invalid collations, Collator Protocol Revamp which hopefully should be implemented this year.

Older non-upgraded validator nodes for example would get a different commitment hash if they don't decode and hash the whole thing and this would break consensus.

You want to hack this into the protocol? I don't get why you not introduce a new version of the networking protocol to be able to append the core_index.

A new version doesn't solve the problem, the problem is that we always need to check all of the commitments when we validate a candidate. We also rely on candidate hashes to identify unique candidates. Having two distinct candidate commitments for same parachain block breaks consensus.

Simply put, we need all nodes to basically encode/decode/reason about the same thing regardless if there is a minority which don't understand the CoreIndex commitment.

use 4 bytes out of this reclaimed space for a new core_index: u32 field.

2 bytes should suffice.

A new version doesn't solve the problem, the problem is that we always need to check all of the commitments when we validate a candidate. We also rely on candidate hashes to identify unique candidates. Having two distinct candidate commitments for same parachain block breaks consensus.

You already answer yourself that this will not work, because the old will try to validate the signature. Thus, you can not simply "reclaim" space in the candidate. Why not use the node features to make a clean cut? To switch from the old to the new format?

CandidateCommitments doesn't really need to be changed, but one idea is to define a special XCM instruction like CommitCoreIndex(u32) that is appended as the last message in the upward_messages vec. The message wouldn't ever be executed, but would just serve as a commitment to a specific CoreIndex.

I have also thought about appending an additional core_index: u32 field to the end of the structure but that doesn't really seem to work because we compute the CandidateHash with hash(CandidateDescriptor, hash(CandidateCommitments)) . Older non-upgraded validator nodes for example would get a different commitment hash if they don't decode and hash the whole thing and this would break consensus.

Any better and less hacky ideas especially regarding the XCM stuff would help me a lot.

This CommitCoreIndex(u32) isn't a good fit for XCM because it would never be executed in the XCVM, it is also not a cross-chain messaging primitive. We'd be really abusing the language.

But IIUC we need a way for the parachain runtime to specify to the relay chain validators/backing-group what core index the current commitment is for. So following on your idea, we don't really need to add it to XCM, what you really want to use is the XCM transport protocol not the language and executor. So you could still use UMP/MQ to transport this extra data, but do it outside of the XCM transport flow.

Disclaimer: I'm not familiar with the details of parachain consensus and elastic scaling, so there might be better solutions than "abusing" the UMP queue (implementation abuse), but wanted to offer a slight alteration to your idea so we don't abuse the XCM language (spec abuse).

A new version doesn't solve the problem, the problem is that we always need to check all of the commitments when we validate a candidate. We also rely on candidate hashes to identify unique candidates. Having two distinct candidate commitments for same parachain block breaks consensus.

You already answer yourself that this will not work, because the old will try to validate the signature. Thus, you can not simply "reclaim" space in the candidate. Why not use the node features to make a clean cut? To switch from the old to the new format?

That is a minor setback, forcing validators to upgrade is something tractable. Forcing all collators switch to a new format is hard breaking change that would put pressure on parachain teams and delaying the enablement of open collator sets for elastic scaling. My plan is to only require an upgrade for collators if they want to use elastic scaling, otherwise they can keep signing their collations and still not commit to a core index just like they do no, at least for a good period of time. Possibly until Omni-node becomes ubiquitous.

That is a minor setback, forcing validators to upgrade is something tractable. Forcing all collators switch to a new format is hard breaking change that would put pressure on parachain teams and delaying the enablement of open collator sets for elastic scaling.

I don't see how you force collators to upgrade? I mean basically you just enable support for the new format on the validators. The collators can send whatever they support, but the validators will reject the new format until it is enabled.

This CommitCoreIndex(u32) isn't a good fit for XCM because it would never be executed in the XCVM, it is also not a cross-chain messaging primitive. We'd be really abusing the language.

But IIUC we need a way for the parachain runtime to specify to the relay chain validators/backing-group what core index the current commitment is for. So following on your idea, we don't really need to add it to XCM, what you really want to use is the XCM transport protocol not the language and executor. So you could still use UMP/MQ to transport this extra data, but do it outside of the XCM transport flow.

thanks @acatangiu for the comment and discussion last week, indeed it makes no sense to add a new XCM instruction.

Disclaimer: I'm not familiar with the details of parachain consensus and elastic scaling, so there might be better solutions than "abusing" the UMP queue (implementation abuse), but wanted to offer a slight alteration to your idea so we don't abuse the XCM language (spec abuse).

How I would see this working is altering the UMP transport layer definition so it can be used to support XCM messages and commitments which seems to be a small surface area for changes in the Parachains Runtime. One way I am thinking this can be implemented is to a have a terminator for XCM messages (something like an empty Vec) followed up by the core index commitment: in receive_upward_messages we would stop sending messages to MessageQueue pallet once we hit the terminator.

That is a minor setback, forcing validators to upgrade is something tractable. Forcing all collators switch to a new format is hard breaking change that would put pressure on parachain teams and delaying the enablement of open collator sets for elastic scaling.

I don't see how you force collators to upgrade? I mean basically you just enable support for the new format on the validators. The collators can send whatever they support, but the validators will reject the new format until it is enabled.

Yeah, sure, we can also do that this sounds like an optimisation to what I just said, assuming we are talking about no longer checking collator signatures and using the bytes for more useful things.

No. I'm speaking about creating a new version of the CandidateCommitment struct. A clear cut and not some hacking. Introduce some new networking version that will only support this new commitment. After it was enabled on the relay chain, collators can start using it, but they can also continue using the old version. The validators will accept both versions.

No. I'm speaking about creating a new version of the CandidateCommitment struct. A clear cut and not some hacking.

I agree that clean cut is generally better than hacking, but we have to take other concerns into consideration before we decide Yay/Nay on this topic.

Development cost

The clean way involves the following big changes on top of what is needed for my proposal:

  • PVF versioning paritytech/polkadot-sdk#645 which needs some big changes in cumulus as well as node pvf and candidate validation modules at least
  • New network protocols, which currently is PITA to do in Polkadot, one small change in one protocol requires changes in all networking protocols just to account for the new version,
  • Parachains runtime changes so we can juggle current and new receipts at the same time.

Set of changes required for my proposal:

  • removing usage of collator signature (which is a good thing to do even outside the scope of this issue) & converting the now unused fields to some reserved and a core_index in the candidate descriptor
  • A few isolated changes in relay chain runtime to allow using UMP both for messages/commitments
  • A cumulus parachain API to allow sending the commitments via the UMP transport

Timeline

The PVF changes and networking protocol changes take a lot of time to test and deploy in production and will create additional bugs to surface that will need fixing, further delaying the launch. Also, audits will take much longer since there would be more code to cover.

On top of that, parachain teams already(Moonbeam roadmap for example) expect to use Elastic Scaling. The feature as it stands now it is unlikely to be usable by most teams since their collator sets are open.

Why we should postpone the clean cut

IMO the cost of not delivering Elastic Scaling without limitations this year outweighs the concerns of doing my isolated hack. Time to market matters. Launching Elastic Scaling but not lifting constraints on the MVP as soon as possible because we want to make things perfect doesn't have any value for Parachain teams.

Introduce some new networking version that will only support this new commitment. After it was enabled on the relay chain, collators can start using it, but they can also continue using the old version. The validators will accept both versions.

The reclaimed space in the descriptor and the possibility to commit to additional things via the UMP transport layer provides enough so we don't need to change the CandidateDescriptor/CandidateCommitments in the near/medium term future. IMO the best time to do it would be as part of migration to JAM.

On top of that, while yes this is not the most clean solution, it is also not that hacky*) and:

  1. If we ever need to do a new version, that hack will go away with it - so we can think of this as an intermediate solution, until we really can not help it and actually need a new version.
  2. This "hack" allows us to get rid of the much uglier hack we did in paras inherent to account for the missing core index.

So while not the most pretty solution:

  • it should work
  • I don't see this backfiring
  • it is quite isolated
  • it allows us to get rid of an even uglier hack - one I am personally actually uneasy about and only accepted it under the premise that it will go away soon.

*) In the end, why not think of this as a message being sent to the relay chain. The only real hacky part is that we did not properly account for any messages to not be XCM, but the delimiter solution sounds good enough. The only code needing to look at this is candidate validation. Code impacted:

  1. relay chain needs to stop processing messages at delimiter.
  2. candidate validation needs to verify core index in message after delimiter.
  3. parachain needs to send coreindex message. (This message should obviously be an enum to allow for extensibility)

(3) is the most annoying, as it is user facing, but the amount of breakage should stay the same:

If we ever end up doing the properly versioned thing (before JAM), then the reason will be that we introduced something else on top. Hence it would be another breaking change regardless.

I agree, that something we want to be mandatory eventually, does not feel to be right as an extra message after a delimiter after the XCM messages. ๐Ÿ˜ฌ But ok, if that contract is well documented, it is not that bad.

PVF versioning paritytech/polkadot-sdk#645 which needs some big changes in cumulus as well as node pvf and candidate validation modules at least

This is like a max 1 day change in Cumulus and the validation modules. We are just exposing a new function and on validation checking which version exists.

  • New network protocols, which currently is PITA to do in Polkadot, one small change in one protocol requires changes in all networking protocols just to account for the new version,

Why do all protocols change, when one of them changes?

IMO the cost of not delivering Elastic Scaling without limitations this year outweighs the concerns of doing my isolated hack. Time to market matters. Launching Elastic Scaling but not lifting constraints on the MVP as soon as possible because we want to make things perfect doesn't have any value for Parachain teams

This was clear since the beginning of the year, that this change would be required or at least quite early. I still remember talking to you about this.

If we ever end up doing the properly versioned thing (before JAM), then the reason will be that we introduced something else on top. Hence it would be another breaking change regardless.

If we open this door, I don't see us closing this door ever again, because the same arguments will be brought up again etc.

PVF versioning paritytech/polkadot-sdk#645 which needs some big changes in cumulus as well as node pvf and candidate validation modules at least

This is like a max 1 day change in Cumulus and the validation modules. We are just exposing a new function and on validation checking which version exists.

Sounds very optimistic. I think we also need to consider the RFC, the impl, writing the unit and zombienet tests as well as deployment.

  • New network protocols, which currently is PITA to do in Polkadot, one small change in one protocol requires changes in all networking protocols just to account for the new version,

Why do all protocols change, when one of them changes?

We have all the validation protocols under a single version - ValidationProtocol which includes statement distribution, bitfields or approval voting protocols. For this specific case we would have to change statement distribution for messages to include the new receipt with core index. This leads to many changes in the node side. Implementing paritytech/polkadot-sdk#810 would have helped but it was not done.

This was clear since the beginning of the year, that this change would be required or at least quite early. I still remember talking to you about this.

Yes, I remember discussing it in the context of extracting a CoreIndex from statements based on validator indices, which is what we currently have implemented for the MVP.

I've drafted a development and deployment plan with repurposing of fields : https://hackmd.io/xrzVVZ_qSZemEIVdEV8cpA
I will come back with similar plan for introducing a versioned CandidateCommitments structure as @bkchr suggested.

Second implementation option draft: https://hackmd.io/fWYvO8HQSFKjUlnpgwjcKw

The long route looks like it will take way too long to be feasible as a solution to elastic scaling, while the repurposing is really not that bad and should get the job done quickly enough. It also avoids needless breakage: Any tooling that does not actually try to check collator signatures for example, will continue to work even without any update.

The second document is just much more detailed than the first one...

feasible as a solution to elastic scaling

Not sure what you mean by this. There is no real time pressure on this.

One pragmatic benefit I see in the longer approach is that we can introduce relay chain hashes to the PVF arguments while we are at it:

Currently we use the relay chain storage root to identify relay chain parents. IMO changing this is not critical, the relay chain storage root is a good enough block identifier currently. However having the relay parent header hash as digest would make it much more convenient to fetch the relay parent for a given parachain block.

There is no real time pressure on this.

  • Without this, elastic scaling is not done.
  • It is a security vulnerability, arguably a minor one, but it still is.
  • As long as we don't have this, we have to keep the actual hacky code in paras inherent processing.

Doing a binary compatible upgrade path as long as we can is not a bad idea, it is much less effort, avoids breakage (and issues) and can be live much faster. We can make part of the collator signature field specify the version and have the remaining unused fields as "reserved", allowing for more binary compatible upgrades.

The commitments part is actually purely optional and can stay that way: It is only used for verification of the core index in the descriptor. Only elastic scaling parachains need to bother at all.

Going full versioned should in my opinion be an effort on its own, when actually needed. We have a couple of things, e.g. what @skunert mentioned, which right now are nice to have, but not crucial. We should go full versioned for something like Polkadot 3.0, where we might need some actual impact full changes, which can not be introduced in a binary compatible way and then we also add all these other nice to haves and can then also move the CoreIndex into its own commitment.

In JAM we should also make sure to have this properly versioned from the getgo.

Also it is worth pointing out that with the binary compatible change, we can have this with one release rolled out:

  1. Validators stop checking the signature, instead they check if it is 0. If so, they expect the CoreIndex in the descriptor and the commitments.
  2. Collators either keep signing or set the signature to 0 and provide the core index according to protocol.

Thus we can easily have fixed factor scaling fully delivered by EoY.

  • Without this, elastic scaling is not done.

But it is worked on. This is no argument at all.

  • It is a security vulnerability, arguably a minor one, but it still is.

Yeah, like not having slashing enabled for all the things, but the stuff is being worked on.

Thus we can easily have fixed factor scaling fully delivered by EoY.

We can still achieve it and even if it is not achieved, the world will not go down.

Also it is worth pointing out that with the binary compatible change, we can have this with one release rolled out:

You are basically proposing a hard fork, without calling it hard fork.

Update: we discussed the 2 approaches offline and decided the implementation will be based on the initial proposal.