Consistent graph replication - RDF Dataset Canonicalization
Opened this issue · 4 comments
When a client requires hard guarantees on consistency, the logic described in the RDF Dataset Canonicalization could be used to provided hashes of the state that should be reached after applying a fragment, or even better, a transaction.
This becomes relevant in cases where LDES is used as a replication protocol for named graphs (the client should have an exact copy of the named graph the publisher intended). For instance, consistency could be lost if a client is offline longer than allowed by the retention period, which could result in missed delete operations (tombstone events). If a checksum mismatch is detected, the client must restart replication from the start of the log to arrive at consistent state.
Reference: https://www.w3.org/TR/rdf-canon/
I think this can be applied generically to TREE (tree client)?
Hmm, I was thinking more to include a hash on each member (version object), that would represent the state of the full represented graph after applying the change:
For instance if we would have a collection {(1,A,State 1), (2, B, Some value), (3, A, State 2)}
After applying the 3th member, we would have the graph:
{(A: State 2), (B: Some value)}. The hash should in this case be the hash of the state of the full graph, if that makes sense 😄
This way we can give much stronger guarantees of consistency.
Of course, the hashes would only be valid in tail of the log due to retention deleting objects that have newer state further in the log.
I actually use that over here, to transform data dumps into an LDES feed: https://github.com/pietercolpaert/DCAT-AP-Dumps-To-Feeds/blob/main/index.ts#L59
I’m not sure however what would be the influence on the LDES spec itself? DO you expect this hash to be present in the member? Do you want a path to point to that property?
Yes, I would see it as metadata of an event, similar like its timestamp. The hash would indicate the state of the graph after applying the member (or members in case of a transaction). This way we can assure graph integrity over time, the client can validate it holds an exact replica of the graph published/intended.
I see this as an important guarantee in cases like the base registries etc.