prestodb/airlift

SparseHll inserts incorrect value for high number of leading zeros

jonhehir opened this issue · 0 comments

This is an edge case that may not be worth fixing.

SparseHll has a small bug that causes the wrong bucket value to be recorded when the number of leading zeros is within 6 of the maximum possible number. For example, with 4096 buckets (indexBitLength = 12), the maximum possible bucket value is 64 - indexBitLength + 1 = 53 (52 leading zeros). Values up to 46 are correctly recorded, but attempting to insert a hash with more 46 or more leading zeros records an incorrect value. This is reflected in the added unit tests in #55.

I have not thoroughly investigated why this is, but I believe the implementation of HyperLogLog++ in SparseHll differs slightly from that laid out in the original paper. (See in particular the encoding and decoding of hashes in the sparse setting, and note that 6 is equal to VALUE_BITS in SparseHll.) I suspect fixing this bug would require making breaking changes to the format of SparseHll. Meanwhile, the probability of this bug occurring is very remote, so its effects are virtually nil. Nonetheless, I figured it would be worthwhile to document this behavior for any future reference.