/go-hll

HyperLogLog in golang

Primary LanguageGoApache License 2.0Apache-2.0

HyperLogLog in golang. Docs. Build Status codecov

What

A go implementation of HypeLogLog data structure with a twist. See HyperLogLog in Practice paper by Stefan Heule, Marc Nunkesser, Alex Hall.

Ø-serialization

There is no need to serialize/deserialize hll. Everything is stored in a byte slice, which can be memory mapped, passed around over the network as is etc.

Differences from the paper:

  • sparse representation. this implementation does exact counting for small sets.
  • fixed memory usage (even for empty HLL). HLL of a given precision P uses fixed (8 + 3*2^(P-2), 8 byte header + 6 bits per register) size in bytes.
  • thresholds are tuned. different from Sub-Algorithm Threshold.

Why

I wanted an HLL implementation that is

  • simple
  • reasonably fast
  • (almost) non-allocating
  • exact when number of unique elements is small
  • memory-mapped file friendly
  • well tested (90+% coverage)

Usage

Get go-hll:

go get github.com/sasha-s/go-hll

Use it:

s, err := SizeByP(16)
if err != nil {
	log.Panicln(err)
}
h := make(HLL, s)
...
for _, x := range []string{"alpha", "beta"} {
	h.Add(siphash.Hash(2, 57, []byte(x)))
}
log.Println(h.EstimateCardinality())

Use good hash (otherwise accuracy would be poor). Some options:

Speed

Benchmark results on my MacBook Pro (Mid 2014).

Add-8            9.68ns ± 1%
Estimate-8       27.3µs ± 1%
Merge-8          38.0µs ± 1%

AddDense-8       6.73ns ± 3%
MergeDense-8     37.9µs ± 1%
EstimateDense-8  22.9µs ± 1%
Sort-8            108µs ± 1%
AddSparse-8      10.2ns ± 3%

Merge/Estimate etc. are done for P=14.

Other implementations (in no particular order)