ethereum/consensus-specs

EIP-7594: Ask for recommandation about sampling

nalepae opened this issue · 4 comments

Let's define:

  • c as the number of columns (currently, 128)
  • q as the number of columns a node should custody (with a current minimum of 4)
  • s as the number of columns a node should sample (currently, 16)

According to the specifications:

If a node already has a column because of custody, it is not required to send out queries for that column.

We can consider 4 scenarios:

Scenario A

Scenario A-1: The s columns to be sampled are chosen randomly from all available columns. If the node already custodies some of the columns to be sampled, these columns are then excluded from the sample set.

  • Case 1: 0 <= q < s: The node will sample between s - q and s columns.
  • Case 2: s <= q <= c: The node will sample between 0 and c - q columns. Particularly, if the s columns to be sampled are a subset of the q custody columns, then the node won't sample any columns.

Scenario A-2: The node's custody columns are excluded before selecting the s columns for sampling.

  • Case 1: 0 <= q < c - s: The node will sample exactly s columns.
  • Case 2: c - s <= q <= c: The node will sample exactly c - q columns.

Scenario B (hypothesis: c is an even number)

For scenario B, we introduce the following constraint: q + s <= c/2.
This constraint is based on the fact that when q + s = c/2, we can already reconstruct all the columns when sampling is successful. Therefore, having s > c/2 - q is unnecessary.

Scenario B-1: The s columns to be sampled are chosen randomly from all available columns. If the node already custodies some of the columns to be sampled, these columns are then excluded from the sample set.

  • Case 1: 0 <= q < s: The node will sample between s - q and s columns.
  • Case 2: s <= q < c/2: The node will sample between 0 and c/2 - q columns. Particularly, if the s columns to be sampled are a subset of the q custody columns, then the node won't sample any columns.
  • Case 3: c/2 <= q <= c: The node won't sample any column.

Scenario B-2: The node's custody columns are excluded before selecting the s columns for sampling.

  • Case 1: 0 <= q < c/2 - s: The node will sample exactly s columns.
  • Case 2: c/2 - s <= q < c/2: The node will sample exactly c/2 - q columns.
  • Case 3: c/2 <= q <= c: The node won't sample any column.

Advantage of scenarios A: The selection of columns to sample is independent of the node's custody set.
Disadvantages of this scenarios A: The number of columns to be sampled is uncertain and could be zero in Case 2.

Advantages of this scenarios B: The number of columns to be sampled is clearly determined by the values of c, q, and s, and will only be zero when q = c (for supernodes) for B-1 or when q >= c/2 for B-2.
Disadvantages of this scenarios B: The selection of columns to sample is not independent of the node's custody set.

Currently, Prysm implements scenario B-2 on its peerDAS branch.

Question: Are there any recommendations or warnings regarding the use of any of the described scenarios?

I believe that scenarios A (both A-1 and A-2) present significant challenges for nodes where q is very close to, but not exactly, c.

For example, if q = c - 1 and the only non custodied column is chosen for sampling (which is always the case in A-2 and may occur in A-1), the success of this sampling will depend entirely on the ability to sample the only remaining non custodied column on the first attempt. In this situation, lossyDAS/incrementalDAS won't be of any help because, if the sampling of this column fails, there are no other columns to attempt in subsequent sampling rounds.

As the value of q increases, it becomes more likely that sampling will declare a block unavailable, because each remaining column to sample carries a significant burden in the event of sampling failure.

This issue does not arise in scenarios B-1 and B-2.

@nalepae ,apologies for the late reply.

I think an important thing to highlight here is the difference between the role of custody and the role of sampling, and the different assumptions around public deterministic assignment and local random (and partially hidden) selection. I was writing about this in detail in my FullDAS writeup. The part about the differences is valid also here, in PeerDAS.

Another important element is how you can count with reconstruction. If you yourself can reconstruct, the data was not withheld (at least not from you), so you can regard it as available. Basically, you have it all, like In the current system. That is, you have certainty, which is better than the probabilistic guarantees a sampling can give you.


With that premise, regarding Scenarios A/B:
If I understand it correctly the difference is how you count with reconstruction.
If you can reconstruct the data, the data is available by definition. So one might argue that q + s > c/2 does not make sense. However, note that:

  • that’s just from the pure sampling point of view, and based on that PoV, we actually define small q and s values, well below c/2.
  • we increase q for another reason, which is to serve sample queries. This has the added benefit of having deterministic guarantees, which might make sense in nodes with high stake.
  • one can also increase s as part of IncrementalDAS, but than there are also losses. Still, it is true that you can stop once you can reconstruct (but at that point you could have all columns you need anyway).

Regarding A-1/B-1 vs. A-2/B-2:
As I understand, the -2 version tries to select the whole sample excluding the custody columns. That I think works as well, and with the same s I think it gives you slightly higher guarantees. But remember that your custody interest is publicly visible and easier to manipulate, so I wouldn’t reduce s.

The part I don’t understand is why you think that already having some of the s columns from other sources (gossipsub topics of the partially overlapping q custody columns) is a problem. If you happen to have some from gossipsub, you only ask for the rest with sampling req/resp. Your probabilistic guarantee is based in the erasure coding and the independence assumptions, i.e. that there was no targeted disclosure based on your sample selection. Since your sample selection is independent of your custody selection, this is not compromised, even if it overlaps.

Taking the case of q=c-1, I don't fully understand your point. At that level, you would be able to reconstruct, and you would be sure that the data was published, at least to you. It is true that you might still wonder whether it was published to the network. But at that point you would see enough network traffic and send it to enough peers, just because of gossipsub, that it would be difficult to withhold the data from the rest of the network. Success of the sampling does not depend on the remaining column, because you can reconstruct it.

Also note that you can always try retrieving the same column again, from the same node or from others. You can also use IncrementalDAS and retry missing columns at the same time.

Overall, I would use the "X-1" type selection (A-1 or B-1), because I think that's what gives you a "uniform" level of security. Then, If I could reconstruct, I would stop the sampling (no need to actually do the reconstruct for the sake of sampling, but we ask for reconstruction to serve columns and help the system).

This does not mean that other options are wrong. I think you can use B-2 without negative consequences. You do slightly more work than a node using A-1, and you also get enough guarantees (and in some cases even a bit more) with B-2.

Thank you for your clear response!

Taking the case of q=c-1, I don't fully understand your point. At that level, you would be able to reconstruct ...

I agree with you. That's why I later introduced the B-1/B-2 scenarios.