canvasxyz/canvas

Gossiplog - Sync max retries while syncing large number of entries

Closed this issue · 1 comments

Currently, if a node is very far behind another node, and tries to sync, it will encounter a max sync retry error. This is happening due to the 3 second sync timeout. I am curious as to why this timeout exists. In a previous issue, #295, I was told it was so we wouldn't hold a lock for too long for the sync transaction. I am assuming that was referring to the ability to append/insert messages while an ongoing sync is happening. I did some tests to see if that would be the case, and discovered that I was able to append during an ongoing sync. Unless I am missing something, that means that the timeout isn't needed. It is causing nodes that may need > 15s to catch up to not be able to sync.

private async sync(peerId: PeerId) {

To run my tests, I patched the sync code to increase the timeout to be 500s instead of 15s.

Here is the code I used to test:

const TOPIC = 'test-topic';

const gossiplog1 = new GossipLog<ReplicatedObject>({
  directory: path.join(__dirname, 'gossiplog-1'),
  topic: TOPIC,
  apply: () => {},
});

const gossiplog2 = new GossipLog<ReplicatedObject>({
  directory: path.join(__dirname, 'gossiplog-2'),
  topic: TOPIC,
  apply: () => {},
});

const peerId1 = await createEd25519PeerId();
const peerId2 = await createEd25519PeerId();

const libp2p1 = await createLibp2p(
  LIBP2P_CONFIG(peerId1, getAddress(9000), [`${getAddress(9001)}/p2p/${peerId2.toString()}`], gossiplog1)
);

const libp2p2 = await createLibp2p(
  LIBP2P_CONFIG(peerId2, getAddress(9001), [`${getAddress(9000)}/p2p/${peerId1.toString()}`], gossiplog2)
);

console.log('Starting gossiplog service 1');
await libp2p1.start();

console.log('Appending 5_000 entries to gossiplog 1');
for (let i = 1; i <= 5_000; i++) {
  await gossiplog1.append(replicatedObject());
  if (i % 1000 === 0) {
    console.log('Appended', i, 'messages');
  }
}

console.log('Starting gossiplog service 2');
let i = 1;
libp2p2.services.gossiplog.addEventListener('message', event => {
  const syncedMessage = (event.detail as { message: Message<ReplicatedObject> }).message;
  const currentClock = syncedMessage.clock;
  if (currentClock === 2_500) {
    const object = replicatedObject();
    libp2p2.services.gossiplog.append(object).then(({ message }) => {
      console.log('Appended message while syncing. Message clock is', message.clock);
    });
    for (let j = 0; j < 500; j++) {
      libp2p2.services.gossiplog.append(object).then(({ message }) => {
        console.log('Appended message while syncing. Message clock is', message.clock);
      });
    }
  }
  if (i % 500 === 0) {
    console.log('Synced', i, 'messages');
  }
  i++;
});
await libp2p2.start();

Also, with the increased timeout from my patch and the message id changes, I was able to sync what seems to be unlimited entries, so thanks for that fix.

We've actually refactored our internals so that syncing doesn't need an blocking transaction anymore! There's still a TIMEOUT error thrown but only when one peer takes more than three seconds to respond to an individual RPC request, but this is timeout is reset on every RPC response, so there is no limit on how long syncs can take now.