Proposal: gzip partition data
whmeitzler opened this issue · 0 comments
Given a naive write of sequential data:
for i := int64(1); i < 10000; i++ {
_ = storage.InsertRows([]tstorage.Row{{
Metric: "metric1",
DataPoint: tstorage.DataPoint{Timestamp: 1600000000 + i, Value: float64(i)},
}})
}
The file contains many stretches of repeated values (mostly 0x00's). This is great for gzip, a byte-stream de-duplicatior.
If I gzip the file:
$ gzip -k data && ls -alsh data* 130 ↵
84K -rw-rw-r-- 1 wmeitzler wmeitzler 82K Sep 20 09:45 data
4.0K -rw-rw-r-- 1 wmeitzler wmeitzler 1.3K Sep 20 09:45 data.gz
I achieve file compression of 21x!
Note that this only really provides value when long stretches of adjacent datapoints are similar. If I populate a file with truly random values, rather than ascending, I can only compress from 82kb to 80kb. And I'm paying CPU time to achieve these small files. I suspect, but have not validated, that a meaningful amount of use cases for a TSDB will generate these adjacent-similarity series that would benefit from gzip compression.
Given that Go allows access to the streaming nature of gzip operations, I propose exploring the use of these functions in the reading and writing of data files.