AE does not yield expected chunk size
Closed this issue · 2 comments
We have noticed that the average chunk size yielded by AE deviates from the target chunk size, especially with higher target chunk sizes, and also on random data.
We suspect that the formula coded in ae_chunking.cpp#L24 and also presented in the original paper is wrong.
Take for example a very high target chunk size like 10 KB. According to the formula, the window size should be 5820 bytes. Assuming uniformly distributed data, the maximum byte value in this window is almost certainly gonna be 255, and the next byte to match that would be expected after another 256 bytes, on average. This would then yield an average chunk size of 6067 bytes.
We briefly looked into this and believe that its related to the maximum value as you pointed out. We noticed a fix within DeStor's AE implementation, where they compare 8 bytes at a time instead of 1 to find the maximum value. This avoids the issue by increasing the value space while providing similar throughput.
We will probably push a similar fix when we push in a newer code version matching our recent IEEE CLOUD paper
For now, we've modified the AE window size to be avg_chunk_size—256, which exhibits closer to the expected average chunk sizes. This is the fix used previously by other systems as well.