Latest processed item as a marker
Closed this issue · 5 comments
There is a note suggesting to keep a list of already processed items. Would it be a recommendable alternative in some situations to instead keep the latest processed item as a marker of how far the stream has been processed?
This is actually a discussion we are already having as part of the LDES client at https://github.com/TREEcg/event-stream-client
We want an approach that works for any fragmentation, even if the fragmentation on the server changed, we want to make sure the bookmark still works. A bookmark then becomes a SPARQL query, for example, this could be a bookmark based on prov:generatedAtTime
only:
SELECT * WHERE {
<lastProcessedItem> prov:generatedAtTime ?bookmarkTime .
<Collection> tree:member ?member .
?member prov:generatedAtTime ?time .
FILTER (?time > ?bookmarkTime)
}
We plan on adding this functionality in the LDES client.
FWTW, I'd recommend against time but rather suggest using a sequence number. Time isn't accurate enough in distributed systems and you can't rely on it for order, therefore you can't in pagination either. Unless you actually deal with predictable time ranges like TrueTime. A sequence number (assigned by a single writer, basically) would therefore be best in your example.
@sroze In the LDES spec you can model all of these! We don’t restrict data publishers on a specific solution.
Regarding timestamps vs. sequence numbers: for resuming syncing with an LDES, we don’t use timestamps nor sequence numbers. The main driver behind pauzing and resuming will be HTTP cache semantics (checking whether you already processed a page based on the HTTP response), and when refetching a page when a TTL passed, then you might find objects that you already processed, which you may of course filter out based on an LRU cache. @sroze A bit like your idempotence constraint in your presentation: https://www2.slideshare.net/samuelroze/event-streaming-symfony-world-2020
So, while we leave it entirely open, I still wanted to comment that the timestamps can be used for pagination if they are set by from the publishing server of course (centralized). In the example given ↑ they are processed by a client as if they are sequence numbers, as we don’t check the time on the client. The query wouldn’t work however when there are a lot of events on the same timestamp, then we certainly also need to check on another property.
Sequence numbers would make it viable for applications to guarantee that each event is processed exactly once though, which will be important for certain projections. Things will slip through a LRU cache eventually, and persisting a single sequence number to disk should be a lot easier to persistent on disk than an entire cache's state. There's also the possibility of the data publisher changing the fragmentation somewhere along the way, where caching headers won't be able to help at all.
Moved to #31