clarkduvall/hyperloglog

Large errors hll++ at high cardinality

nieksand opened this issue · 5 comments

Once I set my cardinality to 100k using hll++, the error rate becomes massive:

approx:  10506
actual:  100000
delta:   89494

My sample code is pretty much cribbed from your test/bench code. Set the truth variable to 1000 and you get exact result. Things look good in the 10k range too. But at 100k... boom.

Complete repro code:

package main

import (
    "github.com/clarkduvall/hyperloglog"
    "fmt"
    "hash/fnv"
    "hash"
)

func hash64(s string) hash.Hash64 {
    h := fnv.New64()
    h.Write([]byte(s))
    return h
}

func main() {
    truth := 100000

    h, err := hyperloglog.NewPlus(16);
    if err != nil {
        fmt.Println(err)
        return
    }

    for i := 0; i < truth; i++ {
            istr := fmt.Sprintf("%s", i)
            h.Add(hash64(istr))
    }
    epp := h.Count()

    fmt.Println("approx: ", epp)
    fmt.Println("actual: ", truth)
    fmt.Println("delta:  ", int64(truth)-int64(h.Count()))
}

Nice find, I'll take a look. Thanks for the repro code!

When debugging this, I found a strangely large number of similar hashes. If you change the hashing line from:

istr := fmt.Sprintf("%s", i)

to

istr := fmt.Sprintf("%s", i*384174)

the results are much better (within a couple hundred). I'm wondering if the fnv hash doesn't handle similar values very well for HLL purposes.

On another note, the reason it explodes after a certain number is because it switches from the sparse representation to the normal representation. The sparse representation doesn't require the hash values to be as well distributed, so the problem doesn't show up until the normal representation is used. Looks like the crossover point is around 59040.

I think you're right. It's the hashing function.

I switched to fnv64a:

    h := fnv.New64a()

and results look fine:

approx:  103611
actual:  100000
delta:   -3611

Apparently that variant has better avalanche properties: http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash

I think this can be closed. It may be worth-while to update the examples/benchmarks to use the a variant of fnv instead.

I popped a PR updating benchmark so that nobody else makes the same mistake I did:

#11