SparseHll inserts incorrect value for high number of leading zeros
jonhehir opened this issue · 0 comments
This is an edge case that may not be worth fixing.
SparseHll
has a small bug that causes the wrong bucket value to be recorded when the number of leading zeros is within 6 of the maximum possible number. For example, with 4096 buckets (indexBitLength = 12
), the maximum possible bucket value is 64 - indexBitLength + 1 = 53
(52 leading zeros). Values up to 46 are correctly recorded, but attempting to insert a hash with more 46 or more leading zeros records an incorrect value. This is reflected in the added unit tests in #55.
I have not thoroughly investigated why this is, but I believe the implementation of HyperLogLog++ in SparseHll
differs slightly from that laid out in the original paper. (See in particular the encoding and decoding of hashes in the sparse setting, and note that 6 is equal to VALUE_BITS
in SparseHll
.) I suspect fixing this bug would require making breaking changes to the format of SparseHll
. Meanwhile, the probability of this bug occurring is very remote, so its effects are virtually nil. Nonetheless, I figured it would be worthwhile to document this behavior for any future reference.