Designs for fault-tolerant DKG initialization

Question

Designs for fault-tolerant DKG initialization

Opened this issue 4 months ago · 13 comments

cygnusv commented 4 months ago

Idea 1: Optimistic Aggregation

Transcript phase

Transcript phase is the same, but with some additional buffer B of nodes also participating. This means that transcripts will be of size N + B participants.
Transcript phase ends when either:
- N+B transcripts are received,
- or, at least N + b transcripts are received within some transcript_window, where b is a protocol parameter and 1 <= b <= B.
Only nodes that submitted transcripts in this phase will continue to participate.

Aggregation phase

Every i-th participating node verifies the received transcripts, and produces a local set I_i of invalid transcripts. Note that all honest nodes should have the same value for I_i.
Any participant can act as a "proposer" and produce an aggregate transcript according to their I_i, so the aggregated transcript will only include N valid transcripts. Note that as a requisite of a successful transcript phase, there are between N+b and N+B valid transcripts available. There's no need for consensus on what N-size subset of the set of valid transcripts are used, as long as they're valid. The proposer submits the aggregate transcript (with an indication of what transcripts were included), and the set I_i.
Aggregation phase ends when an aggregate is submitted or when the ritual times out.

Challenge phase

In the optimistic case, the aggregate will be correct and there are no more protocol actions. We can include an optimistic_window of time here to leave room to challenging. After this window, ritual is considered successful.
In the pessimistic case, the aggregate is invalid and at least 1/2 of the original transcript proposers will flag it on-chain as invalid within the optimistic_window, in which case there's evidence to slash the aggregate proposer. The ritual is considered failed.

Answer 1 · 2024-01-31T15:01:32.000Z

Idea 2: Time-based submission queue

Flow

Every validator in the ritual cohort may put their transcript onchain, with a minimum of shares_num and a maximum of validators_num transcripts.
The next stage is started after the minimum number of transcripts shares_num is reached. If shares_num is not reached within some time window, the ritual times out.
Validators are assigned tr_idx based on their transcripts submission order. Every validator has time until the stage_start + tr_idx + time_slot_length block to submit a transcript aggregate[1]. If they run out of time, they can still submit their aggregate[2].
If at any point any validator disagrees with the posted transcript aggregate, they can post their version if they haven't done so yet. The disagreement is solved by super-majority voting [3]. The challenge period is cp_len = time_slot_length * validators.len() long and starts at stage_start

Discussion

Number of transactions total: validators.len() transcript submissions + [1, validators.len()] aggregate submissions. In the latter, validators.len() could be replaced with any number of votes it takes to achieve a super-majority. In the best case scenario, a disagreement between one node and the rest of the cohort can be solved by only two votes against that node (3 tx total).

For validators, not competing for the position of an eligible validator (i.e. not posting their transcript on time) would cause them to miss out on the DKG participation. To not post the transcript aggregate, still permits their participation in the ritual, but it puts their welfare at the mercy of other validators.

[1] Instead of posting transcript aggregate, we may also post sha256(tr_agg), or x last bytes of it. We don't need the exact aggregate onchain, as it can be reproduced from the transcripts.
[2] Adherence to "timeslots" is not a hard requirement - It's simply an optimization aimed at reducing the number of transactions made by validators.
[3] Validators who weren't selected (they didn't post their transcripts in time) can still participate in the transcript aggregate posting and voting.

Answer 2 · 2024-01-31T15:12:35.000Z

Idea 3: P2P Signature Collection

Transcript phase

Same as for Idea 1 (Optimistic Aggregation)

Aggregation phase

Every i-th participating node verifies the received transcripts, and produces a local set I_i of invalid transcripts. Note that all honest nodes should have the same value for I_i.
There's a pre-defined window of time for each participant to act as "aggregate proposer" (similar to @piotr-roslaniec's idea above). During this window, the proposer will produce an aggregate transcript according to their I_i, so the aggregated transcript will only include N valid transcripts. Note that as a requisite of a successful transcript phase, there are between N+b and N+B valid transcripts available. There's no need for consensus on what N-size subset of the set of valid transcripts are used, as long as they're valid.
The proposer presents the aggregate to each selected participant and collects their signature as approval.
The proposer submits the aggregate transcript (with an indication of what transcripts were included), the collection of signatures, and the set I_i. The contract will only accept the aggregate if N valid signatures are presented.
Aggregation phase ends when an aggregate is submitted or when the ritual times out.

Note how this is an hybrid approach that mixes on-chain and off-chain phases. As opposed to Ideas 1 and 2, which are optimistic and therefore require an additional challenge window, the off-chain phase ensures that the ritual is final with the aggregation phase, either successfully or failing with a timeout.

Answer 3 · 2024-01-31T17:57:15.000Z

It might be helpful to describe the fault scenarios against which tolerance is required. Obviously when the transcript aggregate is finalized, we regard this as byzantine-fault tolerant (at least in terms of its crypto-economics).

So we're talking about a "crash fault" (insofar as it needs to be tolerant for downtime but not necessarily protocol discoordination) at aggregation time, right? Are there other crash faults against which tolerance is required? Or other phases?

What are the fault scenarios most in mind? (Stemming from the recent failures and others like them that are easy to imagine)

Answer 4 · 2024-02-01T15:42:24.000Z

Obviously when the transcript aggregate is finalized, we regard this as byzantine-fault tolerant (at least in terms of its crypto-economics).

Correct.

It might be helpful to describe the fault scenarios against which tolerance is required. (...)
So we're talking about a "crash fault" (insofar as it needs to be tolerant for downtime but not necessarily protocol discoordination) at aggregation time, right? Are there other crash faults against which tolerance is required? Or other phases?
What are the fault scenarios most in mind? (Stemming from the recent failures and others like them that are easy to imagine)

Potential fault scenarios:

Not enough transcripts to continue to phase 2, either due to missing or invalid transcripts.
Missing or invalid aggregated transcript

In general, missing transcripts is due to either client bugs, operator misconfiguration, or RPC endpoint faults, while invalid transcripts is not something that we've observed and can only happen due to client bugs (very unlikely) or malicious operators.

Answer 5 · 2024-02-09T12:48:36.000Z

Time-based submission queue

Every validator in the ritual cohort may put their transcript onchain, with a maximum of shares_num transcripts.

The next stage is started after the minimum number of transcripts shares_num is added. If shares_num is not reached within some time window, the ritual times out.

@piotr-roslaniec I think the required number of transcripts should be greater than shares_num. If not, then there's no recovery possible in the event of a fault during the aggregation phase. With this change, round 1 would be the same as Ideas 1 and 3.

Actually, I think the Time-based queue (Idea 2) is very similar to Idea 1. Here's a refined proposal taking elements from both.

Idea 4: Optimistic Ordered Aggregation

Transcript phase

Similar to Idea 1 (that is, with B additional participants and parameters b and transcript_window to make the phase more flexible; transcripts will be of size N + B participants), with the only addition that participants are ordered based in transcript posting order (as in idea 2)

Aggregation phase

Similar to Idea 1, with the exception that there is a pre-defined criteria to choose the N-size subset of transcripts that will be aggregated (e.g., the first N valid transcripts). Optionally, we can use the concept of submissions windows from Idea 2 to regulate who's the proposer, but I don't see this as necessary (we can achieve a similar effect by randomizing the time to submit the aggregate, which we recently added). Another potential variant is to submit a hash of the aggregate instead, as proposed in Idea 2.

Since there's consensus on the method to construct the aggregate transcript, then under the honest majority assumption, there will be consensus on whether the expected aggregate transcript exists or not, and in the positive case, how it should look like. This is important for the next phase.

Challenge phase

Same as idea 1 for the optimistic case (that is, there's no objections against the aggregate, and after some time, the ritual is considered successful)
In the pessimistic case, the aggregate is deemed invalid by some participants. The original transcript proposers (a.k.a "challengers") can flag it on-chain as invalid within the optimistic_window, proposing at the same time the correct aggregate (which, again, under the honest majority assumption, should be the same for all honest voters). As soon as someone challenges the aggregate, all participants are required to challenge it as well. Two things can happen:
- A super-majority of nodes agree that the aggregate was invalid and propose the same correct aggregate. Ritual succeeds with the new aggregate.
- A super-majority of nodes ratify the original aggregate (implying that the first challenge is incorrect). Ritual succeeds with the original aggregate.
- Otherwise, ritual fails.

Failure scenarios

In what circumstances the protocol can fail?

Less than N + b transcripts received in phase 1, so ritual times out.
Enough transcripts submitted in phase 1, but no aggregation in phase 2, so ritual times out.
Enough transcripts submitted in phase 1, more than one aggregation in phase 2, and super-majority not achieved for a candidate aggregation in phase 3.

Answer 6 · 2024-02-09T13:03:04.000Z

@cygnusv you're right, it should be validators_num. What I meant to say is that until shares_num is not reached we may not proceed to the next stage. I updated the description.

Optionally, we can use the concept of submissions windows from Idea 2 to regulate who's the proposer, but I don't see this as necessary (we can achieve a similar effect by randomizing the time to submit the aggregate, which we recently added).

Randomization also sounds good to me. In principle, any selection method that is fair and transparent qualifies IMHO.

Another potential variant is to submit a hash of the aggregate instead, as proposed in Idea 2.

By the way, I don't see any reason not to use the hash aggregate instead of the real aggregate, do you? Refreshing and recovery don't depend on it either.

As soon as someone challenges the aggregate, all participants are required to challenge it as well.

So the difference between this part (Idea 2 vs Idea 4) is that all participants submit their vote right away, instead of timing their vote sequentially. Now that I think about it, Idea 2 makes this process more time-sensitive and complicated, thus prone to error.

And just to make sure we have a shared understanding of what super-majority means: It's a super-majority in the traditional sense, yes, but the "denominator" is the number of all validators selected for the DKG (validators_num, not shares_num).

Answer 7 · 2024-02-09T15:25:32.000Z

@piotr-roslaniec agreed with all your points.

Answer 8 · 2024-02-13T20:04:45.000Z

As soon as someone challenges the aggregate, all participants are required to challenge it as well. Two things can happen:

A super-majority of nodes agree that the aggregate was invalid and propose the same correct aggregate. Ritual succeeds with the new aggregate.

A super-majority of nodes ratify the original aggregate (implying that the first challenge is incorrect). Ritual succeeds with the original aggregate.
Otherwise, ritual fails.

In the pessimistic case, the aggregate is deemed invalid by some participants

If the proposed aggregate was indeed invalid, but was posted by a node (node_i), and that node is part of cohort for that aggregate, but then a supermajority subsequently agree/ratify on the corrected aggregate, is there any concern about node_i still being in the cohort? Of course, node_i can be replaced down the road after ritual completion, but just wondering. In other words, would this scenario have an effect on when node replacement should potentially occur for a completed DKG ritual that has "questionable" members, but that a supermajority overruled.

It's great that the ritual goes through (improves our fault tolerance), but given that we made it through the ritual with questionable members, is there something we should be cognizant of regarding the longevity of the ritual cohort?

Same as idea 1 for the optimistic case (that is, there's no objections against the aggregate, and after some time, the ritual is considered successful)

Any initial thoughts/musings on the length of the challenge phase? This may have an impact on node event scanning interval.

Answer 9 · 2024-02-14T08:18:33.000Z

is there any concern about node_i still being in the cohort

I think it's a valid concern and we could "trim" the cohort down to shares_num of supermajority voters or fail if their number is smaller than shares_num.

Any initial thoughts/musings on the length of the challenge phase?

I didn't think about it too much but an order of 20 blocks seemed reasonable to me (a couple of minutes essentially).

Answer 10 · 2024-02-14T14:36:11.000Z

One tricky aspect of optimistic challenge windows, is that we need to ensure that the node has sufficient time to realize that it needs to challenge when required, and have enough time for the tx to go through since this challenge would be an important tx. Given the relation to the current node scanning interval (which is currently 2mins.) - this window may need to be something >= 4mins. or something like that, or else there may not be sufficient time for a node to issue a challenge if needed. Unless, of course, we want to modify the event scanning interval - this may have implications on the volume of RPC calls.

Just something to think about and not something we need a definitive value for right now.

I did like the definitiveness of having nodes submit aggregations, although I recognize the additional cost to nodes for transactions. The equivalent of that here would be to have nodes submit an aggregate hash (an explicit signed endorsement of that aggregate), until a supermajority of them agree on a particular hash, then the ritual is complete. But, as mentioned before, the tradeoff here is the additional cost of transactions made by nodes.

Answer 11 · 2024-02-14T15:13:30.000Z

If the proposed aggregate was indeed invalid, but was posted by a node (node_i), and that node is part of cohort for that aggregate, but then a supermajority subsequently agree/ratify on the corrected aggregate, is there any concern about node_i still being in the cohort?

Great question, I did think about this but I see this as an edge case that requires take into consideration what are the incentives in play. For starters, the main goal for the improved ritual initiation protocol is to increase robustness against unresponsive nodes. While it does provide some robustness against malicious nodes, it doesn't achieve the same level.
There's no honest reason for a node to submit a correct transcript but a wrong aggregate. From a cryptoeconomic perspective, this is a very stupid move, since it implies putting on-chain direct evidence for slashing. For this reason, I prefer that we basically "ignore" these type of edge cases and solve them differently.

Having said that, let's go back to your question: what should we do in this case? This is something that can even be detected at the smart contract level, so we have a lot of flexibility here, but with important trade-offs. On the one hand, we can choose to continue with the ritual and go with whatever the super-majority of nodes decide; the misbehaving node would have left evidence for future slashing and the ritual would go on with a "questionable" member, as you put it, which shouldn't be much of an issue in a threshold cryptosystem. This is what I initially considered and why it wasn't mentioned in the proposal. On the other hand, we can choose to cancel the ritual; this could be considered a "safer" choice but it would be introducing the possibility that a single misbehaving node can force ritual failure.

Regarding longevity, I agree that "questionable" members are the most immediate target for replacement.

Any initial thoughts/musings on the length of the challenge phase? This may have an impact on node event scanning interval.

I would leave it to be as long as the overall timeout left, but I guess we can leave some room for optimization.

Answer 12 · 2024-02-15T14:03:16.000Z

I would leave it to be as long as the overall timeout left, but I guess we can leave some room for optimization.

I've been wondering about this generally. How performant do we care about ritual ceremonies being? Are we ok with it potentially taking the entire timeout (currently 3hrs which I think was just an overestimation to allow for issues to be resolved)? Should it be more performant if certain use cases potentially need to have them created on the fly, ideally in the order of a few mins.?

Of course for the beta program it doesn't matter, although even then we are currently in the order of a few mins., but in the permissionless future does it being performant begin to matter more ...

Answer 13 · 2024-02-15T14:13:03.000Z

I've been wondering about this generally. How performant do we care about ritual ceremonies being? Are we ok with it potentially taking the entire timeout (currently 3hrs)? Should it be more performant if certain use cases potentially need to have them created on the fly, ideally in the order of a few mins.?

The waiting time is inherent to optimistic protocols, but I guess we can introduce an option to make it not optimistic (i.e., all nodes are requested to vote on aggregation), as long as the adopter pays for it.