polkadot-fellows/RFCs

Mild gossip reform

Opened this issue · 2 comments

We've always kinda winged our gossip protocols, so they're both heavier than required, and maybe do not deliver the desired resilience.

We've different gossip in different places, but an important one being used looks like a grid plus a spamy layer. If I recall, the spamy layer always fires, even if when grid works, which sounds wasteful.

At an intuitive theory level @chenda-w3f and I have now reached some understanding of what should be the relevant properties. In brief..

  • randomized typologies improve resistance against eclipse attacks and similar, with
  • a locally randomized topology being somewhat better than a globally randomized one,
  • but local randomization gives a random regular graph with overlaps which make the gossip less efficient,
  • It's supposedly not much less efficient, but every time he quotes me numbers then they sound quite noticeable.
  • If you rerandomize a topology by global randomness then faster helps more I guess.

In asymptotics,if we've n nodes then a locally randomized topology with valency O(n^{1/k}) for k=2,3 should have "small" constant diameter d, but small is not d=2,3 like you get from k=2,3 respectively with a deterministic 2d or 3d grid topology. I've now forgotten what Chen said O(log n) yields, but maybe diameter O(log n).

We're largely lacking in concrete numbers here, although sometimes Chen reported various ones from the literature.

We suspect the diameter two given by the 2d grid topology winds up being overkill, so we'd suggest polkadot pay some latency in exchange for a less spamy topology. We propose an experiment in which the gossip topology be defined by the union of two topologies:

  1. We define a 3d grid in which every validator sends to 3 n^{1/3} = 30 other validators. We deterministically resort all validators into this grid based upon a context-entropy under which the messages gets sent. Although deterministic, we'll update the context-entropy faster for more sensitive protocols, so this helps more against eclipse there:

    • Approval system messages would set context-entropy to be the relay chain VRF output for the relay chain block being discussed.
    • Initially grandpa would set the context-entropy to be the epoch randomness, but maybe we could select something faster, maybe post-BEEFY.
  2. We send to n^{1/2} = 32ish random other validators selected using system randomness.

At present, our 2d grid sends to 2 n^{1/2} = 64 validators and our spammy layers sends to even more, so this reduces the total message attempts by whatever the spammy layer does.

We'll be adding hops here though so TTL shall matter more, so we should discuss how TTL works in gossip now. I donno yet..

I'd expect two development costs here: Any TTL changes of course, required even for experimental work. All the plumbing required by the new context-entropy input. I think context-entropy being constant makes sense for testing though, which simplifies this considerably.

We'll have @chenda-w3f run some theory simulations of this topology before anyone starts implementation work. It's be helpful if someone could comment here on what the existing topologies really looks like so Chen can compare. We think more about how much context-entropy input helps too.

cc @rphmeier @eskimor @sandreim @heversonbr

The existing topologies are a simple 2D grid, where the originator of a message broadcasts on both the X and Y axes and the 1st hop recipient rebroadcasts along the other axis. The topology is fixed per-session (4 hours) and is globally randomized by the BABE random beacon value of the session.

The only exception is in backing, where validators initially coordinate directly within their group and only circulate sufficiently attested candidates along the grid.

Large data is typically requested via request/response protocols, with only small data circulated via gossip.

I learned later that currently the extra non-grid paths do not unify with the 2d grid, and it sounded like the TTL stops at two hops. This maybe kinda fragile, but sounds less spammy than what I'd been lead to believe.

We'll chat about this more at the retreat. I've not yet talked chen into doing any simulations that give more concrete numbers.