Proposal: gzip partition data

Question

Proposal: gzip partition data

whmeitzler opened this issue 2 years ago · 0 comments

Given a naive write of sequential data:

for i := int64(1); i < 10000; i++ {
	_ = storage.InsertRows([]tstorage.Row{{
			Metric:    "metric1",
			DataPoint: tstorage.DataPoint{Timestamp: 1600000000 + i, Value: float64(i)},
		}})
}

The file contains many stretches of repeated values (mostly 0x00's). This is great for gzip, a byte-stream de-duplicatior.

If I gzip the file:

$ gzip -k data && ls -alsh data*                                                                                                                                           130 ↵
 84K -rw-rw-r-- 1 wmeitzler wmeitzler  82K Sep 20 09:45 data
4.0K -rw-rw-r-- 1 wmeitzler wmeitzler 1.3K Sep 20 09:45 data.gz

I achieve file compression of 21x!

Note that this only really provides value when long stretches of adjacent datapoints are similar. If I populate a file with truly random values, rather than ascending, I can only compress from 82kb to 80kb. And I'm paying CPU time to achieve these small files. I suspect, but have not validated, that a meaningful amount of use cases for a TSDB will generate these adjacent-similarity series that would benefit from gzip compression.

Given that Go allows access to the streaming nature of gzip operations, I propose exploring the use of these functions in the reading and writing of data files.