barryWhiteHat/roll_up

How to provide data availability guarrentees inside the EVM

barryWhiteHat opened this issue · 22 comments

Due to #3 its quite expensive to add a public input to the snark. So we cannot provide data availability that way. We need to provide this some how in Minimal Viable Roll_up (MVR)

So when a proof happens we know that noTx leaves have changed. So what we do is require the prover to in evm make a proof that 1000 leaves are different between the two trees. This will cost tree_depth * 2 sha256 operations right now. sha256 costs 72 gas per word. So we would need to pay

>>> gasCost = 72 
>>> tree_depth = 32
>>> noTx = 1000
>>> gasCost * tree_depth * noTx
2304000
>>> gasCost * tree_depth * noTx * 2
4608000

Which is 4000 gas per tx. Its not gr8 but it should work. Tho it means that if we move to pedersen commitments we would need to implment it very efficiently or else our data availability guarrentees would limit tx per second.

This also works out as much cheaper than passing public inputs to the snark verification.

The process is:

  • people submit the leaves they want including in the tree to an on-chain contract
  • zksnark proof updates the merkle root, which includes all these leaves
  • repeat etc.

So, the 'delta' is the new leaves between each SNARK proof, and the circuit inputs these leaves as private variables, and proves that the hash of all of the leaves in-sequence is equal to the one calculated on-chain, then it adds them all into the tree and calculates the new root.

So:

def circuit(public:old_root, public:new_root, public:leafs_sequence_hash, private:leafs):
    leaf_hash = 0
    for leaf in leafs:
        leaf_hash = H(leaf_hash, leaf)
    assert leaf_hash == leafs_sequence_hash
    root = merkle_append(old_root, leafs)
    assert root == new_root

This verifies that the leafs provided to the circuit match the ones added on-chain, without having to provide the leaves as public inputs. It also attests that the merkle root has been updated specifically to include those leaves. To insert a leaf, the person calls an on-chain contract and provides it with the leaf hashes, this could be done once, or in batches.

But that's for cheaper inserts into the merkle tree only, with aggregation of inserts via zkSNARK proof - which is different from what roll_up is right?

To re-cap some discussion earlier.

Data availability can be guaranteed by splitting it into the 'tx pool' and the 'prover', this also provides an anti-censorship mechanism.

There is a consensus of nodes which represent the 'tx pool', they guarantee data availability.

If I want a TX included in a block, I must first prove to the 'tx pool' nodes that it's a valid transaction for a given merkle tree root (e.g. the current block).

The tx pool signs this information, confirming they have received proof that it's a valid transaction and hold all the data necessary for the prover to include it in a block.

I then take this information to the prover, I say 'I have proof, from the tx pool, that I have a valid transaction' without revealing any details about that transaction to the prover.

The prover then gives me a token/signature which guarantees they will include this transaction in the block, however they don't know any information about that transaction (yet).

For the next block, the prover then provides all the guarantees for transactions to the tx pool nodes - which reply with the details of the transactions.

This provides a mechanism which splits responsibility of data availability and proving transactions, so the prover must commit to what it will include without knowing what it is, and that all information it needs will be available - so would require collusion between the tx pool nodes and the prover to censor transactions.

This is a half-baked idea, but is worth investigating.

This provides a mechanism which splits responsibility of data availability and proving transactions, so the prover must commit to what it will include without knowing what it is, and that all information it needs will be available - so would require collusion between the tx pool nodes and the prover to censor transactions.

If whatever consensus mechanism used involves every tx pool node seeing the plain transaction, then it should be fairly easy for a prover to also run a tx pool node in order to view the plain transactions submitted by users and as a result the prover would still be able to censor effectively. An alternative could involve DFINITY style random sampling such that only a subset of tx pool nodes check the validity of a user's transaction which might make it more difficult for a prover to see all user's plain transactions.

This would be a UX hit for users, but instead of relying on a tx node pool, maybe users could generate a SNARK proving the validity of a transaction without revealing it to the prover. The prover would then provide a signature guaranteeing the inclusion of the valid transaction without knowing its contents until it actually receives the plain transaction later on for inclusion in the Merkle tree.

Good insight.

This would be a UX hit for users, but instead of relying on a tx node pool, maybe users could generate a SNARK proving the validity of a transaction without revealing it to the prover. The prover would then provide a signature guaranteeing the inclusion of the valid transaction without knowing its contents until it actually receives the plain transaction later on for inclusion in the Merkle tree.

Yes, this was a line of thinking I had too.

But then the 'data availability' problem rears its head again, where if the prover commits to including a transaction - you need to guarantee that it has all the data available for it and the transaction owner can't withhold the information. At least - if the prover will be penalised for not including that data, you shouldn't be able to deliberately withhold information.

e.g. we want some kind of game theoretic commitment by the prover, where it's penalised for not following the rules of the game (where it must prove everything it's committed to) ?

IMO we should provide strong data availability guarrentees. For me this means being able to recover from failure modes.

Ethereum provides this by saying if data ever becomes unavailable then the chain is invalid and we will roll back to the last state where data is available and continue mining from there.

Our in EVM approaches to data availability proves this by relying on ethereum to provide these guarrentees.

This provides a mechanism which splits responsibility of data availability and proving transactions, so the prover must commit to what it will include without knowing what it is, and that all information it needs will be available - so would require collusion between the tx pool nodes and the prover to censor transactions.

However this approach does not allow this. If the tx pool is malicious they can lock all users funds by making data become unavailable. Is this a proposal for a complete data availability guarrentees or just one layer of it?

So when a proof happens we know that noTx leaves have changed. So what we do is require the prover to in evm make a proof that 1000 leaves are different between the two trees. This will cost tree_depth * 2 sha256 operations right now. sha256 costs 72 gas per word. So we would need to pay

Where is input for these sha256 operations coming from? From the tx data? A byte of Ethereum transaction costs 68 gas (see Gtxdatanonzero), i.e. 2176 gas per word. tree_depth * 2 * cost_per_word == 139k gas per tx. Checking merkle root in EVM would thus defeat the purpose of scaling.

Even if this can be somehow optimized to only posting the addresses and hashes of the leaves on mainchain, it's still around 5k gas per tx. A batched token transfer can already be achieved in EVM at 15k gas per tx, so the saving is not as radical, especially when the costs of zk-proof are considered.

Have you guys considered approaches from the link below (erasure coding, probabalistic availability checks by light clients)?
https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding

We are kind of moving away from this proposal in favor of an idea from @HarryR

Where we pass all of the leaves that are updated. Hash them together in the EVM and then passing that to the snark that again hashes the new leaves and then verifies that the result equals the public input from the EVM.

So in that case we would only need to pass the list of leaves to the EVM which would cost noTx* cost_per_word == 2,176,000 we also have to do 2046 hashes inside the EVM which is feasible for sha256 but for our more efficient in snark computations it gets more expensive.

But the problem here is that is increases the number of constraints in the snark required. For noTx = 1000
the number of hashes required is 2046 which makes 50k * 2046 ~= 100 million so we would probably have to limit noTx to around 100 which will limit our gains in terms of gas savings.

Even if this can be somehow optimized to only posting the addresses and hashes of the leaves on mainchain, it's still around 5k gas per tx.

Ah okay this makes it seem impossible to provide data availability guarantees inside the EVM. Maybe not impossible but a 3X optimization is less exciting.

So perhaps we have do some something similar to plasma and allow people to exit on a stale state. Then we don't need to have data available but we do need to have exit time limits.

Have you guys considered approaches from the link below (erasure coding, probabalistic availability checks by light clients)?
https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding

My thinking on erasure coding is that in this context we only have one chunk. So it turns into data compression. But because all of the data that we have to publish are hashes and therefore sudo random we probably can't get decent compression of them. Tho i am open to investigating this approach a little more.

Ideally we can construct a system in which users can go offline for prolonged periods of time and still reasonably expect their funds to be safe. Let's call it long exit guarantee. Zk-SNARKs cover the most difficult part of it: no incorrect transactions can be posted. Now, hopefully we find a solution the remaining challenge, data availability.

If I understand correctly, stale state exit approach would not have long exit guarantee, because users must act immediately when they face data unavailability for their branch. Right?

A batched token transfer can already be achieved in EVM at 15k gas per tx

Can you give me some info on this? I want to check if the batched tx's are unique.

Even if this can be somehow optimized to only posting the addresses and hashes of the leaves on mainchain, it's still around 5k gas per tx.

We can probably compress the leaves from 32 bytes currently so we can get sub 5k gas per tx.

>>> 2**32
4294967296
>>> len(bin(4294967296)[2:])
33
>>> 256 / 33
7.666666666666667

so that is 7 addresses per 32 bytes and addresses are probably compressible on top of that.

I still think that 1.5 to 3 improvement is justify MVP 1 having EVM based data availability guarantees.

But lets talk about that at our next meeting #16 :)

Ideally we can construct a system in which users can go offline for prolonged periods of time and still reasonably expect their funds to be safe. Let's call it long exit guarantee. Zk-SNARKs cover the most difficult part of it: no incorrect transactions can be posted. Now, hopefully we find a solution the remaining challenge, data availability.

So lets move this to another issues to keep things a little cleaner here. #15 is for data availability outside the EVM and here for for in EVM data availability.

A batched token transfer can already be achieved in EVM at 15k gas per tx

Can you give me some info on this? I want to check if the batched tx's are unique.

The scheme can look like follows. Transfer data contains compressed from, to, value, nonce and sig (~5k gas). EVM will verify the signature (3.5k gas) and update 2 words in storage (10k gas), one per account, each word containing nonce and balance. One Ethereum transaction contains N transfers, just like plasma snapp update tx (+22k overhead / N transactions). So it's probably about 20k gas overall, not 15k.

The scheme can look like follows. Transfer data contains compressed from, to, value, nonce and sig (~5k gas)

What kind of compression do they use?

Just 4 bytes per account.

So the 'account' is the leaf offset? e.g. each transaction submitted on-chain in EVM is:

  • uint32 - from (leaf offset)
  • uint32 - to (leaf offset)
  • uint32 - value
  • uint32 - nonce

Then that data is hashed into a sequence, and the same hash sequence is verified in-snark?

With that form each tx is 128 bits, and it accurately describes the movement of funds from one account to another so you can calculate the balance of each account after each block. And you can pack two transactions into each EVM word.

So gas cost is:

((68 * 16) * n) + (hashCost * n) + snarkVerification + overheads

Then, you can pass the 'old_root' and 'new_root' parameters to EVM too, and hash them, so the SNARK circuit only has 1 public input, keeping verification costs to a minimum.

I see what @barryWhiteHat means about 128 bit security level, if every tx has 128 bits of tx info, and a truncated 128 bit hash of the public key, then you can fit the whole leaf into 256 bits, the problem though is the fast in-circuit hash algorithms don't work on 256 bits - you need to fit them in 253 bits.

But you could make the nonce 29 bits (max 536m txs per account). So say each tx is of the format:

  • uint32 from
  • uint32 to
  • uint32 value
  • uint29 nonce
  • uint128 pubkey hash

Then the on-chain costs become:

((68 * 32) * n) + (hashCost * n) + snarkVerification + overheads

And all on-chain tx data has enough info to fully re-create and verify the tree.

Say snarkVerification costs 500k. With 100 txs the gas cost per-tx for snark verification is 5k.

n tx input data cost hash cost @ 5k total (with snark) per tx
10 19,200 50,000 569,200 56,920
50 108,800 250,000 858,800 17,176
100 217,600 500,000 1,217,600 12,176

So you can see the larger the number of TXs included per-block the cheaper it gets. However, the cost is dominated by the 'cheap' hash cost - and maybe 5k per hash is too much, it might be a lot cheaper than that...

The signatures for the transactions don't need to be included on-chain. And the pubkey hash doesn't need to be either...

But. it looks like using zkSNARKs can do account-model similar to batched updates, cheaper than EVM alone, if you use SHA256 has the on-chain hash function this becomes very cheap on-chain (much cheaper than batched EVM transactions) and the break-even threshold is reduced at the expense of a snark circuit which takes much longer to prove.

So right now our gas cost per tx using Harry's data availability guarantees

3500 * noTx + noTx * 2 * hashCost + snarkVerifcation

@gluk64 does that look correct to you?

Where we can optimize;

  1. noTx * 2 * hashCost by packing multiple addresses into a single sha256.
  2. The snark verification #2
  3. noTx * 2 * hashCost by compressing the leaves that we pass to the snark by removing bits at the end and just verifying that the start of the hash equals the result. We can do this as long as we don't go under 128 bits security i suppose. But i am not crazy about this approach.

Yes. Is 5k expected cost of hash with Pedersen commitment?

Do we actually need pubkey in these hashes? Are they required from the point of view of data availability? If not, we can indeed cut the cost by 2 by hashing 2 transactions in 1 word.

Still, the cost of even 10k per tx is very high. Remember that SNARK proof generation is equally if not more expensive. And as is shown above, simple transfer under the same conditions can be implemented on pure EVM in under 20k gas. So what do you think?

Besides, Ethereum transaction data does not have same data availability guarantees as EVM storage data. Any miner can safely discard all transactions prior to a block he believes is sufficiently irrevertable. So the entire solution depends on the existance of the altruistic archive nodes, who will be willing to provide all the Gigabytes of historic data upon request. I'm not such nodes exist even now...

Yes. Is 5k expected cost of hash with Pedersen commitment?

It depends on the implementation and how much i can optimize. I think 5k is optimistic. I would think more like 10k.

Do we actually need pubkey in these hashes? Are they required from the point of view of data availability?

No we do not include these. The user can calculate these from the transactions they send. When someone receives a payment they get the actual leaf data also. otherwise it does not count as a receipt.

Still, the cost of even 10k per tx is very high. Remember that SNARK proof generation is equally if not more expensive. And as is shown above, simple transfer under the same conditions can be implemented on pure EVM in under 20k gas. So what do you think?

Agree this is a major limitation. I am thinking about data availability outside EVM. Will post about it in a while. Not sure if we should move forward with EVM data availability would like to hear other weigh in @HarryR @yondonfu @PhABC

Besides, Ethereum transaction data does not have same data availability guarantees as EVM storage data. Any miner can safely discard all transactions prior to a block he believes is sufficiently irrevertable. So the entire solution depends on the existance of the altruistic archive nodes, who will be willing to provide all the Gigabytes of historic data upon request. I'm not such nodes exist even now...

If they throw away the tx how can anyone sync the chain? They need to confirm signatures to calculate the current state ?

If they throw away the tx how can anyone sync the chain? They need to confirm signatures to calculate the current state ?

There is no need to verify the entire chain from genesis. All clients can have trusted block number hardcoded and download a snapshot of the state. Parity and geth both support such a quick sync mode, I suspect that most new users do not download full chain. And the absolute majority uses light clients with Infura.

Closeing in favor of #19 and #18