/countmemaybe

A set of distinct value estimators that give probabilistic bounds on a sets cardinality

Primary LanguagePython

Count me Maybe

Hey I just met you, This is crazy, But here's a number, So count me, maybe?

countmemaybe is a set of probabilistic data structures that can estimate the cardinality of a set and also support unions. The base set of operations are:

  • Cardinality
  • Union
  • Cardinality of the Union
  • Cardinality of the Intersection
  • Jaccard Index

Insertions are quite quick! For both hyperloglog and kminvalues there is an average insertion time of 4us on a 1.8GHz i7 Macbook Air.

Below are some benchmarks of error rates. The set used for estimation are 160000 random 16bit integers and the benchmark was generated using the test_dve.py script. For some as of yet unknown reason, KMV with a 64bit hash function performs worse than a 32bit hash function. This is counterintuitive and requires more investigation. Furthermore, exact mean relative error rates vary from run to run by ~1%

Benchmarks