clarkduvall/hyperloglog

Weird results post-merge

mediocregopher opened this issue · 2 comments

I know hll's aren't supposed to be accurate, so if this is simply a case of finding a limitation in the accuracy then that's cool, I just wanted to double check. I have the following code:

func main() {
	add := func(hll *hyperloglog.HyperLogLogPlus, i int) {
		h := fnv.New64a()
		h.Write([]byte(strconv.Itoa(i)))
		hll.Add(h)
	}

	a, _ := hyperloglog.NewPlus(12)
	for i := 0; i < 10000; i++ {
		add(a, i%100)
	}
	fmt.Printf("a: %d\n", a.Count())

	b, _ := hyperloglog.NewPlus(12)
	for i := 0; i < 10000; i++ {
		add(b, (i%100)+5)
	}
	fmt.Printf("b: %d\n", b.Count())

	a.Merge(b)
	fmt.Printf("a+b: %d\n", a.Count())
}

I get the output:

a: 100
b: 100
a+b: 6

So the Counts for a and b individually look good, but post-merge it seems to get borked. Again, just wanted to check if this is a bug vs a limitation in what hll's can actually do. Thanks!

So one issue is that fnv.New64a does not provide very well distributed hashes for the integers 0-100:

func main() {
        for i := 0; i < 100; i++ {
                h := fnv.New64a()
                h.Write([]byte(strconv.Itoa(i)))
                fmt.Printf("%x\n", h.Sum64())
        }
}

prints:

af63ad4c86019caf
af63ac4c86019afc
af63af4c8601a015
af63ae4c86019e62
af63a94c860195e3
af63a84c86019430
af63ab4c86019949
af63aa4c86019796
af63b54c8601aa47
af63b44c8601a894
...

Which really screws with the accuracy of hyperloglog. Maybe try a Murmur3 hash or something similar, it might give better results. I tried it out with https://github.com/spaolacci/murmur3 and it gave much more accurate results (a+b: 105).

In addition to this, I just merged #17, which will keep the sparse representation when merging two small HLLs. This will also increase accuracy.

Ah, thanks! I had come up with that test case as a generalization of something else I was doing that didn't involve integers I don't think, but if I see this problem again I'll keep murmer3 in mind, that might be worth switching to in general anyway.