SEMICeu/LinkedDataEventStreams

Delayed flush strategy profile for views

Opened this issue · 0 comments

As long as a fragment is being written to, the client needs to parse the entire fragment again, including the members it already processed.
In the worst case scenario, when the client polling interval is >= the member addition interval, the amount of effectively transferred and parsed members is given by (max_fragment_size * (max_fragment_size + 1) /2), where max_fragment_size is the max. number of members allowed in a fragment.
Say 5 members are allowed per page, and the client is polling faster than member are written to the fragment (and etags are used to avoid needless processing), than the amount of effectively transferred and parsed members = 1 + 2 + 3 + 4 + 5 = 15.
If the max_fragment_size is set to 250 members, 31375 members are parsed to process 250 members.
At this point the efficiency of both data transfer and parsing compute has dropped to 0,80% =((max_fragment_size * (max_fragment_size + 1) /2)/max_fragment_size)) *100, leaving 99,2% of resources spend of algorithmic overhead.

These 31375 members, have to be transferred over the web and parsed, hence a lot of bandwidth and cpu cycles are wasted.
In order to mitigate this, I would propose to introduce some semantics in LDES, so that a server can indicate to a client that it follows the 'delayed flush strategy'.
So how would this work? A server would only write out fragments that are immutable. It can do this by buffering writes for a maximum time (let's go with 10 seconds for the argument). After this time (or when the max fragment size is reached), all buffered members are flushed as an immutable fragment. This results in potentially smaller fragments being written, which is fine for those reading the at the end of the log.
As a client knows (from the view definition) that fragments are written out in an atomic fashion, it only needs to request the fragment once.
If no relations are found in the fragment, the client can then fall back to polling using HEAD requests, when the relations are present in the HTTP headers.

Of course, this means that data can be delayed during the buffer window, but the trade-off would be reasonable to make.
The profile could be announced on the view with a simple statement.