google/cdc-file-transfer

fastcdc implementation sets chunk boundaries before last gear-hashed byte.

dbaarda opened this issue · 4 comments

The fastcdc implementation adds data[i] in the gear rollsum, checks if the hash meets the chunk boundary criteria, and then returns i as the chunk length. Since data is zero indexed, this means the last byte included in the rollsum that met the boundary criteria is not included on the end the chunk, but ends up at the start of the next chunk.

This has an interesting degenerative case; if you get all the identified chunks for a file and re-arrange them into a different order, then cdc-file-transfer will find nearly zero duplicate chunks, even though every single chunk is duplicated. The only duplicates it will find are the ones where the byte after the chunk happens to be the same in both files.

In practice this probably rarely happens, as changes would rarely align exactly on the edge of chunk boundaries, but it would be a trivial fix to avoid this degenerative case.

I don't know how important preserving the chunking behaviour is for backwards compatibility... are "signatures" of the chunk lengths/hashes stored and used later, possibly by different versions?

I can easily put together a pull request to fix this if backwards-compatibility is not a problem.

Nice catch, thanks! That's a bug indeed. Another issue with this: the input can never end exactly on a chunk boundary that is > cfg_.min_size. The gear hash is never stored, the chunks are identified based on their BLAKE3 hash, so there is no danger of corruption when fixing this. The only mildly negative effect is that for cdc_stream, the cache will be invalidated because all chunk boundaries change, but it's not a big deal.

Feel free to send me a pull request. If not, now that I look at the code again after a long time, I see a couple of other issues that I'd like to address, so I could also fix that bug myself.

Note that you need to accept the Google Contributor's License Agreement before we can approve your contribution:
https://cla.developers.google.com/

I have a few other things related to the fastcdc/gear implementation that I'll send a pull request for soon, but this particular bug/fix is tiny so I'll probably send a separate fix for it first.

The other stuff is related to my research https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst which showed that the "normalization" fastCDC does is actually worse than simple exponential chunking, and I see that you normalize it even more than fastCDC did.

It turns out the deduplication wins FastCDC reported for normalized chunking were an artefact of a smaller average chunk size. Any "normalizing" of the chunk-size distribution actually hurts deduplication because it makes the chunk boundarys more dependent on the previous chunk boundary, which messes with the re-synchronisation after insert/delete deltas. If you use the same average chunk size and choose your min/max settings right, a simple exponential chunker gets better deduplication.

I was going to file another bug related to that, but I thought I'd whip up some code and test it all first.

BTW, I also work for Google, but stumbled onto this during my holidays doing hobby non-work stuff. I dunno if I technically needed to but I signed the CLA using my personal gmail/github account details.

Interesting, I might take a look when I find some time.

Note that for our use-case, deduplication was not that only requirement that we had. First and foremost, I implemented FastCDC for cdc_stream to support playing a game without uploading it to the cloud first. The goal was to utilize the bandwidth well and reduce the number of round-trips. This works best with fixed chunk sizes as that gives predictable latencies.

My implementation of FastCDC is a compromise between predictable chunk sizes and deduplication. That's why I introduced multiple stages and bit masks to get a stronger normalization than the original algorithm proposed. The application of FastCDC in cdc_sync came as an afterthought. We never optimized the implementation for the rsync use-case, though we probably could just use less mask_stages for cdc_sync to reduce the normalization.

Ahh... interesting, so the normal distribution of chunk sizes was an explicit objective/requirement.

Note in my testing I came up with an even "better chunk normalizer" that gives an almost perfect normal distribution (it's actually a Weibull distribution) that's much simpler to implement than FastCDC's "NC" normalizer. It was my first experiment to improve on FastCDC and was surprised when it had even worse deduplication. This led me to test a whole bunch of things that clearly showed any sort of normalization makes deduplication worse.

If your requirements are tightly constrained chunk sizes, then "Regression Chunking" might be worth using. It improves deduplication when you have a small max_size, at the cost of some reduced speed benefits from cut-point-skipping. It hardly sacrifices any deduplication with max_size=2avg_size, and even works OK with max_size=1.5avg_size.