Some ideas about streams...
sandervd opened this issue · 1 comments
First of all, I'm really new to the spec, so I'm sure I'll make some wrong assumptions here...
But I have leaned that the easiest way to learn something on the internet is to make a statement, and someone will try to prove you wrong.
So here it goes:
I under the impression that the design of the event stream was focused on being bulk-loadable in a triplestore as-is, in such a way that the data can be queried without further processing; the state is present in the stream itself.
Although I think this is a really interesting feature, its biggest disadvantage is that the semantics of the event metadata and the entity data get mixed.
I'll try to argue here for an alternative approach, where the event stream is seen as more as transaction log.
So this is how I understood the spec; what I thought to be the requirements that where behind the definition of spec:
- Events are appended to the end of the stream
- Events are not business events, but the representation of the full state of an entity.
(I assume this because of the design where the stream is queryable in and by itselve, more a growing graph.) - Entities can be updated
- Old version, that no longer contribute to the state of the system can be cleaned up, if the stream decides to do so.
So now for what I see to be the disadvantages to the current LDES approach:
- The event data and metadata are mixed. Because of this, the stream semantics might need to be updated when the datamodel of the producer evolves.
There is a risk of having semantic collision between carrier and message, if for instance a DCAT Datasat would be carried over a stream, the use of isVersionOf would conflict. This is worked around with by the ldes:timestampPath and ldes:versionOfPath, but gives room to errors. - No order to the events can lead to inconsistency in the materialised state derived from the stream. Using time for ordering in a distributed system will lead to errors at some point.
- Deletion of objects is implicit, through removing members from the stream.
- Lack of transaction semantics. Since the state of objects is carried over, it might be that multiple objects need to be changed at once in order to remain adherent to the defined shapes.
(This might not be relevant if the stream consist of sensor data, where each entity is an instance of the same class, but if the source is a business application the need will arise eventually). - Due to the approach to versioning, the graph of all events can't be used directly, but a subgraph needs to be derived by the indirection through the tree:member property of the event stream instance.
This seems to imply that for each update two pages: the first containing the stream definition and the last one with the latest event.
In case the stream grows to terrabytes of data, the reprocessing of the (huge) EventStream instance could become an issue. - Another issue I see with the versioning is the mix between application versioning (an entity received updates through time, and I want to see previous versions of it), and infrastructure concerns (I want to be able to do point-in-time restores if the application update messes up the database).
So, after this, I would like to propose another event representation that tries to address these issues(Example 1 of spec):
ex:C1 a ldes:EventStream ;
tree:shape ex:shape1.shacl .
[
a ldes:Event ; # Could be implied if pages are dedicated?
ldes:sequence "1"^xs:integer ;
ldes:commitTime "2021-01-01T00:00:00Z"^^xsd:dateTime ; # Used as metadata in case a point-in-time recovery of the stream needs to be performed, and the recovery point needs to be determined).
ldes:key <ex:Observation1> ; # The subject of the object represented.
ldes:value [ # value is optional, if not present this is a tombstone event (delete).
a sosa:Observation ;
sosa:resultTime "2021-01-01T00:00:00Z"^^xsd:dateTime ;
sosa:hasSimpleResult "..." .
]
]
This also allows for representing the stream in a key-value ledger model as used by Apache Kafka.
The semantics for materialising this stream in a triplestore would be simple.
For each event in stream:
- Delete all triples in DB with the event key as subject (ignoring nested blank nodes for a moment).
- Take the ldes:value of the event, and set the subject of the triples to the event:key. Insert into DB
Under this model, versioning of objects can be done at the application level using the isVersionOf approach if the versions concerns the application (e.g. multiple versions are kept so users can roll back to a previous version of an entity), as well by the log retention policy algorith that is chosen on the stream level, which only concern is to be able to move back in time (for instance for point in time recovery), and doesn't need to concern about application versioning.
A retention policy could keep the full state (decided by the producer what should make up the state), as well as the transaction log for a certain time.
(following the idea of how topic log compaction is done in Apache Kafka)
Additional properties could be used to signal transaction boundaries (if retention policies apply, the boundaries can only be used when reading the head of the log, where messages are sequential).
e.g. ldes:trx_id "1"
If the stream delivery method can't guarantee that events are committed to the stream representation in an atomic way, a transaction event count could be added to the first event part of a transaction, to indicate the number of events that should be read as part of the transaction.
Seems like you propose a kind of reification model to point at an event. You can still create a data model, for example a Sander’s Data Stream vocabulary that defines sds:Event
and sds:sequence
and sds:key
as you propose. We can then apply just default LDES techniques, on members that are shaped according to the sds:Event
.
There are multiple ways of organizing streams so I’d like to keep this to application-level. I want to limit LDES to the concept of a container of immutable objects, as multiple organizational system can be built on top of that. It is true however that we do need descriptions on top of an LDES or a ViewDescription to give hints to an application about how a stream is managed. For example, that’s why tree:shape
, tree:timestampPath
, tree:versionKey
, etc. on top of ldes:EventStream exist. This makes it more clear to LDES clients what should happen in a back-end when the stream is being processed.
Other solutions could entail:
- Using something like an sds:Member from our Semantically Describing Streams vocab
- Using a CRDT vocabulary solid/vocab#69
- Using the Activity Streams event notifications: https://www.eventnotifications.net/
- ...
In a Slack thread you specifically indicated problems with tombstone objects. I think we could add however a specific hint in the LDES description to the type that will be used for tombstone objects, so we can still use different vocabularies within LDES to manage the stream.