Besides being a compact and pretty speedy HyperLogLog implementation for cardinality counting, this modified HyperLogLog allows intersection and similarity estimation of different HyperLogLogs.
A simple implementation of HyperLogLog (LogLog-Beta to be specific):
- 16 bit registers instead of 6 bit, the new 10 bit are for b-bit signatures
- Similarity function estimates Jaccard indices (a number between 0-1) of 0.01 for set cardinalities on the order of 1e9 with accuracy around 5%
- Intersection applies the Jaccard index on the union of the sets to return the intersecting set cardinality
The work is based on "HyperMinHash: Jaccard index sketching in LogLog space - Yun William Yu, Griffin M. Weber"
sk1 := hyperminhash.New()
sk2 := hyperminhash.New()
for i := 0; i < 10000; i++ {
sk1.Add([]byte(strconv.Itoa(i)))
}
sk1.Cardinality() // 10001 (should be 10000)
for i := 3333; i < 23333; i++ {
sk2.Add([]byte(strconv.Itoa(i)))
}
sk2.Cardinality() // 19977 (should be 20000)
sk1.Similarity(sk2) // 0.284589082 (should be 0.2857326533)
sk1.Intersection(sk2) // 6623 (should be 6667)
sk1.Merge(sk2)
sk1.Cardinality() // 23271 (should be 23333)
Set1 |
HLL1 |
Set2 |
HLL2 |
S1 ∪ S2 |
HLL1 ∪ HLL2 |
S1 ∩ S2 |
HLL1 ∩ HLL2 |
350 |
352 |
752 |
752 |
831 |
832 |
271 (32.611312%) |
274 (32.932692%) |
746 |
748 |
591 |
590 |
834 |
835 |
503 (60.311751%) |
501 (60.000000%) |
248 |
248 |
789 |
791 |
897 |
899 |
140 (15.607581%) |
144 (16.017798%) |
9 |
9 |
818 |
818 |
824 |
825 |
3 (0.364078%) |
3 (0.363636%) |
408 |
411 |
412 |
408 |
771 |
771 |
49 (6.355383%) |
47 (6.095979%) |
Set1 |
HLL1 |
Set2 |
HLL2 |
S1 ∪ S2 |
HLL1 ∪ HLL2 |
S1 ∩ S2 |
HLL1 ∩ HLL2 |
2126 |
2138 |
1162 |
1158 |
3063 |
3060 |
225 (7.345739%) |
223 (7.287582%) |
7767 |
7706 |
7054 |
7064 |
8889 |
8887 |
5932 (66.734166%) |
5888 (66.254079%) |
842 |
844 |
5183 |
5135 |
5880 |
5842 |
145 (2.465986%) |
135 (2.310852%) |
6833 |
6791 |
664 |
666 |
7410 |
7345 |
87 (1.174089%) |
89 (1.211709%) |
1814 |
1820 |
6214 |
6169 |
7697 |
7639 |
331 (4.300377%) |
320 (4.189030%) |
Set1 |
HLL1 |
Set2 |
HLL2 |
S1 ∪ S2 |
HLL1 ∪ HLL2 |
S1 ∩ S2 |
HLL1 ∩ HLL2 |
29667 |
29540 |
88700 |
88167 |
92444 |
91667 |
25923 (28.041842%) |
25036 (27.311901%) |
79242 |
78731 |
30216 |
30137 |
83502 |
82953 |
25956 (31.084285%) |
25995 (31.337022%) |
57830 |
57223 |
79550 |
79194 |
82112 |
81595 |
55268 (67.308067%) |
54684 (67.018812%) |
64610 |
63501 |
21696 |
21729 |
75895 |
74816 |
10411 (13.717636%) |
10083 (13.477064%) |
92204 |
91453 |
96417 |
95556 |
165025 |
163370 |
23596 (14.298440%) |
24130 (14.770154%) |
Set1 |
HLL1 |
Set2 |
HLL2 |
S1 ∪ S2 |
HLL1 ∪ HLL2 |
S1 ∩ S2 |
HLL1 ∩ HLL2 |
150443 |
149810 |
974366 |
979514 |
1088517 |
1096991 |
36292 (3.334077%) |
37417 (3.410876%) |
156337 |
155347 |
19083 |
19070 |
167353 |
165433 |
8067 (4.820350%) |
8017 (4.846071%) |
800969 |
802044 |
51053 |
51244 |
851388 |
853396 |
634 (0.074467%) |
511 (0.059878%) |
176155 |
174707 |
520111 |
516822 |
570092 |
569289 |
126174 (22.132217%) |
123766 (21.740452%) |
485954 |
481362 |
967341 |
972651 |
1081990 |
1091296 |
371305 (34.316861%) |
376007 (34.455088%) |
Set1 |
HLL1 |
Set2 |
HLL2 |
S1 ∪ S2 |
HLL1 ∪ HLL2 |
S1 ∩ S2 |
HLL1 ∩ HLL2 |
7132942 |
7150720 |
122116 |
121539 |
7243153 |
7261709 |
11905 (0.164362%) |
12550 (0.172824%) |
8646240 |
8649049 |
1277784 |
1295017 |
9821480 |
9854242 |
102544 (1.044079%) |
99163 (1.006298%) |
4192390 |
4164637 |
2788913 |
2779975 |
4526476 |
4499897 |
2454827 (54.232630%) |
2454356 (54.542493%) |
9803344 |
9826412 |
1705700 |
1715798 |
10255010 |
10262719 |
1254034 (12.228501%) |
1273821 (12.412120%) |
1308849 |
1322604 |
9940327 |
9971519 |
11179030 |
11201850 |
70146 (0.627478%) |
80717 (0.720568%) |
Set1 |
HLL1 |
Set2 |
HLL2 |
S1 ∪ S2 |
HLL1 ∪ HLL2 |
S1 ∩ S2 |
HLL1 ∩ HLL2 |
13237748 |
13298469 |
57073758 |
57124720 |
59474437 |
59394847 |
10837069 (18.221390%) |
11143669 (18.762013%) |
90757994 |
88576114 |
5717797 |
5701796 |
95061178 |
93016636 |
1414613 (1.488108%) |
1350058 (1.451416%) |
60150663 |
60033013 |
79238333 |
77672994 |
110438475 |
108311818 |
28950521 (26.214162%) |
27666946 (25.543792%) |
30187492 |
30718889 |
37756209 |
37153655 |
67443566 |
66938074 |
500135 (0.741561%) |
447406 (0.668388%) |
53196095 |
53461989 |
48344583 |
47535284 |
93284291 |
91321031 |
8256387 (8.850780%) |
8036467 (8.800237%) |