waku-org/research

Cost function of message retrieval

Opened this issue · 5 comments

What should determine the cost of retrieving a message?

PoC approaches

The server may estimate the cost as:

  • constant per message
  • constant per byte
  • dependent on message age (presumably, newer messages are cheaper)
  • dependent on other message parameters (topic, metadata, etc)

More elaborate cost calculations

The costs of a Store server are storage, bandwidth, and computation.
A Store server does two things: it stores messages, and serves messages to clients.

The cost of storing messages is composed of:

  • storage:
    • storage costs of all older messages: proportional to cumulative (message size x time stored);
    • the cost for I/O operations for storing new messages (roughly constant per unit time, though may fluctuate due to caching, disk fragmentation, etc.);
  • bandwidth (download) for receiving new messages;
  • computational costs.

The cost of serving messages to clients, per unit time, is composed of:

  • bandwidth
    • upload: proportional to (number of clients) x (length of time frame requested) x (message size);
    • download: proportional to the number of requests;
  • computational cost of handling requests.

Storage is likely the dominating cost.
Storage costs is proportional to the amount of information stored and the time it is stored for.
A cumulative cost of storing a single message grows linearly with time.
The number of messages in a response may be approximated by the length of the time frame requested.

Computation: the server spends computing cycles while handling requests.
This costs likely depends not only on the computation itself, but also at the database structure.
For example, retrieving old or rarely requested messages from the local database may be more expensive than fresh or popular ones due to caching.

More formal calculations should be done, under certain assumptions about message flow (i.e., that it is constant).

With the new store req-res that returns message ids, then it becomes trivial to have a cost per message.
This assumes that client does not pay for message ids, but pays for message content:

sequenceDiagram
	client-->>server: request msgs ids for period t1 to t2
	server-->>client: returns 10 msg ids
	client->>client: check seen msgs
    client->>client: 3 msgs are missing
	client->>blockchain: pays server for 3 msgs
	client->>server: request 3 messages, attaches txid
	server->>blockchain: checks payment
	server-->>client: deliver messages
Loading

PS: sorry about the label deletion, this PR will remove this kind of automtaed action: waku-org/pm#103

client does not pay for message ids, but pays for message content

The scheme makes a ton of sense!

Do you think we should aim for it in the PoC, or in later protocol iterations? As of the current PoC outline, the client skips the request-ids stage and instead pays a fixed price per hour of history, and gets all relevant messages in response, no matter if the client already has them or not. This is kinda wasteful (client gets messages it already has) but conceptually simpler (just one round-trip).

Another thing that I like about this scheme is that the client can also use the list of returned hashes:

  • to compare responses by different store nodes and choosing the one with the most up-to-date response
  • verifying that a store node does indeed store at least some valid messages by comparing to known hashes in its own message set
  • to compare responses by different store nodes and choosing the one with the most up-to-date response

Absolutely, I think I wrote somewhere how Alice could then do a sync req to Bob and Carole.
Compare what messages they have, query/pays Bob because he's cheapest to retrieve the bulk of messages, and then query/pays Carole of a unique message she has, but she is more expensive.

It may also make the market more interesting:

  • Bob is cheaper because he only store 12 hours of messages.
  • Carole is more expensive because she stores a week of message.

Alice will always use Bob for the most recent messages, but has to use Carole if she was offline for more than 12 hours.

What is interesting here, is that once the protocol is available, this is purely an engineering problem of window-shopping and cherry-picking.

client does not pay for message ids, but pays for message content

The scheme makes a ton of sense!

Do you think we should aim for it in the PoC, or in later protocol iterations? As of the current PoC outline, the client skips the request-ids stage and instead pays a fixed price per hour of history, and gets all relevant messages in response, no matter if the client already has them or not. This is kinda wasteful (client gets messages it already has) but conceptually simpler (just one round-trip).

I will let you and @jm-clius decide here. I think it may be worth considering having it from PoC, if store sync can be delivered before.

My current opinion is that when store sync is available, then we can assume the best practice will become:
(A)

  • store sync
  • store query missing messages using id.

instead of
(B)

  • store query using using time frame or cursor and get full messages.

My expectations might be incorrect tho. My assumption is that (A) gives you a better bandwidth/latency ratio.

store sync:

  • Quick DB query as I'd assume the msg id are stored in a separate table and hence the IO is minimal for a larger amount of messages
  • quick over the wire as again, data is small
  • No or limited pagination (msg id per page can be higher than msg per page in a normal query)

query with msg id:

  • Quick DB query as primary key can be used
  • quick over the wire, as only useful messages are transmitted.
  • No or limited pagination

Even if we always 2 rounds trips, the benefit over (B) should be visible as soon as (B) returns more than 1 page of results.

Note: when I write "store sync" I mean "store query to retrieve messages id". This may not be the correct/final name of the protocol.

My current opinion is that when store sync is available, then we can assume the best practice will become:
(A)
store sync
store query missing messages using id.

My tangential concern here is about the relationship between Store sync and incentivized Store requests. If both are available, why would a rational client use incentivized requests? It can "pretend" to be a syncing server and get the same data for free. I know there may be heuristics to distinguish such clients-pretending-to-be-servers, but I'm not sure we can do it with certainty.