Weird results post-merge
mediocregopher opened this issue · 2 comments
I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:
func main() {
add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
h := fnv.New64a()
h.Write([]byte(strconv.Itoa(i)))
hll.Add(h)
}
a, _ := hyperloglog.NewPlus(12)
for i := 0; i < 10000; i++ {
add(a, i%100)
}
fmt.Printf("a: %d\n", a.Count())
b, _ := hyperloglog.NewPlus(12)
for i := 0; i < 10000; i++ {
add(b, (i%100)+5)
}
fmt.Printf("b: %d\n", b.Count())
a.Merge(b)
fmt.Printf("a+b: %d\n", a.Count())
}
I get the output:
a: 100
b: 100
a+b: 6
So the Counts for a
and b
individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!
So one issue is that fnv.New64a does not provide very well distributed hashes for the integers 0-100:
func main() {
for i := 0; i < 100; i++ {
h := fnv.New64a()
h.Write([]byte(strconv.Itoa(i)))
fmt.Printf("%x\n", h.Sum64())
}
}
prints:
af63ad4c86019caf
af63ac4c86019afc
af63af4c8601a015
af63ae4c86019e62
af63a94c860195e3
af63a84c86019430
af63ab4c86019949
af63aa4c86019796
af63b54c8601aa47
af63b44c8601a894
...
Which really screws with the accuracy of hyperloglog. Maybe try a Murmur3 hash or something similar, it might give better results. I tried it out with https://github.com/spaolacci/murmur3 and it gave much more accurate results (a+b: 105).
In addition to this, I just merged #17, which will keep the sparse representation when merging two small HLLs. This will also increase accuracy.
Ah, thanks! I had come up with that test case as a generalization of something else I was doing that didn't involve integers I don't think, but if I see this problem again I'll keep murmer3 in mind, that might be worth switching to in general anyway.