Somewhat large manifest protos for > 1 GB databases

Question

Somewhat large manifest protos for > 1 GB databases

pkhuong opened this issue 3 years ago · 1 comments

We currently refer to each 64KB page in a database with a 16 byte fingerprint. That's a good 4000x reduction, but still means a 256 KB incompressible manifest for a 1 GB db file.

Most DB writes don't actually change that many pages, so we should be able to expose that redundancy to the compressor. Do so by stashing a base list of fingerprints (as raw bytes in a content-addressed chunk), and xor-ing that with the list of fingerprints before serialisation and after deserialisation. This should result in a lot of zeros for zstd to shrink.

In order to achieve this:

Decode manifest protos with an optional base chunk
Encode manifest protos with a base chunk
Try to decompress manifest blobs when they look zstd-compressed
Compress manifest blobs
~~Avoid re-uploading chunks back-to-back in copier.rs~~ already handled! (RecentWorkSet)
Figure out a policy to reset the base chunk
Make sure not to use base chunks for dbs < a certain size
Stash the latest Arc<Chunk> for each db somewhere: this guarantees we keep them in the global cache.

Answer 1 · 2022-02-07T00:23:46.000Z

Quick tests with always creating a fresh base chunk shows zstd has no problem compressing our runs of zeros. In fact, incompressible data shows a static overhead of 13 bytes for our manifests, so compression seems to even be able to shave a few bytes off the rest of the proto payload.