Choice of VSS to achieve performance, low data cost, and fault-tolerance
joebebel opened this issue · 2 comments
The fundamental problem is described very well in the DKG in the Wild paper: "the difficulty of differentiating between a slow node and a faulty node in the asynchronous setting are the primary sources of the added complexity."
Basically a slow node (or a faulty node that comes back online) must be able to complete the DKG, however, if the other nodes have already signed "ready", then it's difficult to guarantee share distribution for the slow/faulty node. The DKG in the Wild paper solves this because when HybridVSS succeeds, the slow nodes are guaranteed to recover because even if the dealer goes offline, everyone else can share to them.
I still think using univariate polynomials is a good idea because HybridVSS is just pretty expensive by itself. But we have to figure out exactly how to do it.
Just to enumerate some options:
- Post everyone's shares in some kind of verifiable-but-encrypted form "on chain". Optimistically, if the shares are only 48 bytes and there are 10,000 shares, and say we need to run 50 VSS instances, that's about 2.4MB chain storage per DKG. Not sure if that estimate is really realistic though. The "verifiable" part might make it more expensive. Maybe the data amount is not prohibitively expensive, but not ideal either.
- Keep trying to patch the univariate scheme to work cryptographically. I really think this should be possible, the challenge is in the details. Maybe there's suboptions for option (2), let's say (2a) and (2b). (2a) is to get consensus on which nodes complete each VSS (and overall DKG) instead of running each VSS separately - the way to make it work is maybe to not treat the VSS instances as independent (because then the faulty nodes might differ between VSS instances) but to also achieve consensus on what nodes have finished common sets of VSS. (2b) is to encrypt the shares in some publicly verifiable way (further investigation needed)
- Maybe use cryptoeconomics instead of verification. That is, the dealer distributes a big blob of everyone's shares (encrypted individually) and each node has no hope of verifying that other's shares are valid, but if someone else's shares turn out invalid, this fact can be verifiably revealed later (even after VSS is complete) and the dealer can be slashed
I actually rather like option 3 now that I consider it more. It uses very little on chain data (more or less the protocol described in the call on Tuesday), has low cryptographic and computational complexity, and particularly cheap in the optimistic case where no one tries to cheat and there are few/no network failures.
The scalability of (1) bothers me. it wastes a lot of resources to handle a non-optimistic case (bad behavior by the dealer) and the scalability of number of shares is itself part of the security of the system (a larger number of shares and therefore increased granularity/resolution can help the security, particularly if there are many small weight nodes in the system)
The cost of running VSS over Tendermint is probably not worth the benefits. In particular the encrypted shares need to be checked for consistency by everyone, which is not only difficult but also not cheap computationally. Not impossible to make it work out but there is a straightforward approach using AVSS which can be made fast, so that seems like the better option.
Cost mitigation measures can be implemented in the future, so the plan is to go with VSS over Tendermint.