Add checksum and other metadata to document prefix
albe opened this issue · 5 comments
A checksum (crc32) in the document should help against corruption and torn writes. (see #31 (comment)) An additional field for document sequence number and commit position would allow 2pc protocols for potential cross-partition transaction with the partition storage. See #71
Making the sequence number writer choosable would allow for consistent reads over multiple partitions even without an index, and hence allow rewriting indexes (see #24). Using a vector clock would allow concurrent writes with at least a causal (partial) ordering. [Edit: vector clocks and version vectors are unfit, because they grow in size depending on cluster size]
Possible document header structure:
[int32] document size
[int32] global sequence number (for consistent reads across partitions, rebuilding index)
[int32] commit position (for 2pc protocol)
[int32/64] timestamp (relative to partition start timestamp - needs partition metadata as in #33)
[int32/64] timestamp2 (upper bound for TrueTime-like "external constistency")
[int32] crc (for error detection, verifiability)
Optimal would be a block of 16 bytes, or a multiple (see #71).
Possible solutions:
- keep crc inside the document, depending on serializer to detect invalid documents, e.g. MsgPack
- use offset as sequence number, like Kafka does and use the int for a second timestamp - makes optimistic concurrency checks harder
- leave out commit position, which prevents 2pc from being implemented on top of the storage directly
- make the prefix larger (32 bytes)
Regarding timestamp resolution:
- in order to support 50k writes/s, the resolution for the monotonic clock needs to be 1/50000s = 0,00002s = 20µs
- at 20µs resolution, a 32bit timestamp can only store a bit over 23 hours, which is definitely too low
- so the high-res timestamp needs to be either stored as 64bit number (javascript internal double float) or a 64bit bigint
- see https://nodejs.org/api/process.html#process_process_hrtime_time and https://nodejs.org/api/perf_hooks.html#perf_hooks_performance_now - note that hrtime seems to be more performant, it has an inaccuracy of ~1.5µs, while performance.now is at ~5µs
const { performance } = require('perf_hooks');
let maxPerf = 0, maxHrtime = 0;
//performance.now(); // uncomment this line and performance.now will always be orders of magnitude slower. Seems the first call initializes stuff and takes ~1ms
for (let i = 0; i < 1000; i++) {
// performance.now is milliseconds
maxPerf = Math.max(maxPerf, -(performance.now() - performance.now()) * 1000);
// process.hrtime is nanoseconds
maxHrtime = Math.max(maxHrtime, -parseInt(process.hrtime.bigint() - process.hrtime.bigint()) / 1000);
}
console.log('performance.now: ', maxPerf, ' - process.hrtime: ', maxHrtime);
hence process.hrtime
should be preferred.
So if TrueTime like external consistency is to be used, more than 16bytes are required for the document header. Otherwise, there is only room for either an external global sequence number or a commit position. In the latter case, this could only be the file offset.
Some benchmarking and testing in how to best write/read a 64bit timestamp to/from a buffer.
Findings:
- use
Number(bigint)
instead ofparseInt(bigint, 10)
to convert to number - use
0x100000000n
as a factor to split the bigint into higher/lower 32bit int
const scale = 0x100000000n;
buffer.writeUInt32BE(Number(time / scale), 0);
buffer.writeUInt32BE(Number(time % scale), 4);
var high = BigInt(buffer.readUInt32BE(0));
var low = BigInt(buffer.readUInt32BE(4));
var time2 = high * scale + low;
But, process.hrtime.bigint()
is nodejs >= v10.7.0 only - so this will not work with nodejs 8, which is currently still supported. Also BigInt is >= 10.4.0, so this would need to be handled via a double precision float (javascript number). It's able to hold a timespan of 285,6 years at 1µs resolution, so safe at least until 2255.
Since ECMA Script timestamps ignore leap seconds (https://tc39.es/ecma262/#sec-time-values-and-time-range), it is safe to choose the Date.now()
time of the partition as the epoch and store it e.g. in the partition metadata. Afterwards all timestamps of a document can be stored relative to that epoch.
Since hrtime has an arbitrary epoch X
, we need to calculate each timestamp of a document relative to the unix timestamp of that epoch. We can calculate that e.g. at the start of the application by measuring hrtime since X
and subtracting that from the current Date.now()
. Since Date.now()
is only milliseconds resolution, this can be off up to millisecond (plus the time of the method call) - but since the startup time takes much longer, no new write will happen earler than that inaccuracy, so monotonicy is still guaranteed.
// The epoch of the partition read from metadata, e.g.
const epoch = (new Date('2019-11-02T20:00:00')).getTime() * 1000.0;
// Epoch of current process execution
const timeStart = process.hrtime();
const currentEpoch = Date.now() * 1000.0;
function time() {
const delta = process.hrtime(timeStart);
return (currentEpoch - epoch) + (delta[0] * US_PER_SEC + delta[1] / 1000);
}
Since time()
is monotonic within one execution time, system time shifts will only affect things if the process is started exactly during a "leap second". Considering this unlikely to the point of not being worth the effort to further work around this edge case (would need to check if current time is a leap second, which is non-trivial).
Still, there is the issue of the system clock being set back between two invocations, i.e. currentEpoch < lastDocumentTimestamp
. This case can be detected and an exception thrown.
Now considering some form of "external consistency" ala Spanner, we need to store the accuracy of the timestamp, so that during a write, we can wait at minimum the inaccuracy of time, before actually inserting a document A
. This guarantees that if any other process would generate a document B
with a timestamp, that timestamp is after the timestamp of A
, i.e. trade-off latency vs. consistency.
That accuracy depends on following factors:
- within a single process it can be assumed to be ~0 (or the time needed for the method invocation to get the time which is ~1.5/2µs as measured above)
- across processes, it's the time required to send a message from one process to another (should on the order of hundreds of µs)
- across machines, it's the accuracy of time synchronization (unless you have access to spanners TrueTime or another high precision synchronized clock, roughly a couple ms)
Though I'm unsure in the last case how far the replication lag also plays a role, because if machineX
generates a document earlier thanY
, but needs longer than the accuracy to send it to the coordinator, the coordinator might receive and hence writeY
beforeX
.
Since the document timestamp is stored relative to the partition epoch, it has relatively low magnitude and hence precision is free to store the timestamp accuracy e.g. in the decimal part of the timestamp.
Addendum:
The time()
method should be injectable/configurable
Can't use partition file creation time as epoch for a monotonic clock, because *NIX filesystems don't consistently store a file creation time. See https://stackoverflow.com/questions/17552017/get-file-created-date-in-node
Hence epoch will be stored in partition metadata and a clock can then measure monotonic time, which will work as a cross-partition ordering field.
As of #80 the document header now contains an external sequence number as well as a (single) monotonic timestamp relative to the partition epoch which is stored in the partition metadata.
Note: Spanner TrueTime uses two timestamps to make the uncertainty of the timestamp explicit. It basically encodes a range of valid times the data was written at. In most cases, this uncertainty is very low in relation to the actual timestamp, e.g. a few ms for timestamps that may grow as old as a couple of years. Hence, it suffices to encode the uncertainty in the decimal part of a double precision float that denotes the amount of µs
since the epoch. Currently, it is encoded as ms uncertainty / 1000
. So a timestamp of 1050.001
denotes a time that is 1050µs +- 1ms after the epoch, i.e. [50µs, 2050µs]
. A potential coordinator could hence use those timestamps to create a consistent ordering of the writes across multiple processes/machines, by waiting 2000µs + RTT for another write coming in, that possibly has a smaller timestamp, before writing and hence committing.