waku-org/research

Sync store - How old messages can be requestd for Sync

Opened this issue · 7 comments

In the Waku Store Sync protocol, a node that is out-of-activity/offline for some time can come back online and ask for messages it missed during the offline period, aka the Sync mechanism. More on the foundational aspects of Sync store here #62

The following Message sync policies should be answered:

  1. In general, a Waku Store node keeps the messages of last 30 days and deletes anything that falls beyond this period. But for a node that wants to Sync with its peers, how much older messages can it Sync?
  2. Should there be a time threshold for a message to be eligible to be a part of the Sync request?
  3. Should we only have one type of Sync mechanism based on a single time threshold?

I have different perspective here:

  1. In general, a Waku Store node keeps the messages of last 30 days and deletes anything that falls beyond this period. But for a node that wants to Sync with its peers, how much older messages can it Sync?

This "general" case will not be common at all and should not be designed for. The fleets we have right now are not a good starting point when thinking about these concepts.

2. Should there be a time threshold for a message to be eligible to be a part of the Sync request?

That should not be up to us. We only design the tools not how to use them.

3. Should we only have one type of Sync mechanism based on a single time threshold?

Query to find the correct hashes then ask for messages from that hash list.

I have different perspective here:

  1. In general, a Waku Store node keeps the messages of last 30 days and deletes anything that falls beyond this period. But for a node that wants to Sync with its peers, how much older messages can it Sync?

This "general" case will not be common at all and should not be designed for. The fleets we have right now are not a good starting point when thinking about these concepts.

That makes sense, thanks.

  1. Should there be a time threshold for a message to be eligible to be a part of the Sync request?

That should not be up to us. We only design the tools not how to use them.

Yeah totally, we need to make it configurable so that clients/apps can choose depends on their needs. But then we need to think about different solutions based on different use cases, a Sync method might work for smaller data but not for bulk data.

  1. Should we only have one type of Sync mechanism based on a single time threshold?

Query to find the correct hashes then ask for messages from that hash list.

Let me rephrase the question, based on how much older messages we aim to Sync, the implementation of such use case may differ, should we start thinking from that flexibility PoV?

Let me rephrase the question, based on how much older messages we aim to Sync, the implementation of such use case may differ, should we start thinking from that flexibility PoV?

When you say older message what do you mean?

If you mean timestamp wise, I don't think it matters. The protocol should not limit the range of queries. Now, implementations should limit requests and/or respond with multiple chunk of data but that's a detail.

If you mean older version of messages then I think we could support any version if we treat messages as data blobs. Only the indexes would be different. As long as we can hash a message deterministically, the number of indexes that would point to a message based on version could change.

a Sync method might work for smaller data but not for bulk data.

Why would it not work?

Let me rephrase the question, based on how much older messages we aim to Sync, the implementation of such use case may differ, should we start thinking from that flexibility PoV?

When you say older message what do you mean?

If you mean timestamp wise, I don't think it matters. The protocol should not limit the range of queries. Now, implementations should limit requests and/or respond with multiple chunk of data but that's a detail.

yeah this one. got it!

If you mean older version of messages then I think we could support any version if we treat messages as data blobs. Only the indexes would be different. As long as we can hash a message deterministically, the number of indexes that would point to a message based on version could change.

a Sync method might work for smaller data but not for bulk data.

Why would it not work?

I mean it would work but may not optimally, so need to consider tradeoffs, if we build a Prolly tree on top of all messageHashes in DB then for sure it might become huge. for eg. If 90% of Sync requests are coming for last 30 mins of data then why making Prolly tree for all the data?

if we build a Prolly tree on top of all messageHashes in DB then for sure it might become huge.

Yes the trees would be as big as the number of messages in the DB.

If 90% of Sync requests are coming for last 30 mins of data then why making Prolly tree for all the data?

Otherwise how would you search for a specific message?

Prolly trees are very efficient for random read and write.

If 90% of Sync requests are coming for last 30 mins of data then why making Prolly tree for all the data?

Otherwise how would you search for a specific message?

How about having two trees, one with let's say past 1 hour of activity and other with remaining, this way we can faster serve hot data which is more prone to be missed. We can also define priorities based on that since hot data is what real-time messaging use case will be interested in for instance.

How about having two trees, one with let's say past 1 hour of activity and other with remaining, this way we can faster serve hot data which is more prone to be missed. We can also define priorities based on that since hot data is what real-time messaging use case will be interested in for instance.

Prolly tree are ordered no need to split them. You would just iterate in reverse to search the latest messages.

The time index would be a tree with timestamp as keys and hashes as values.