/minsketch

Several implementations of count-min sketches and variants thereof.

Primary LanguagePython

minsketch

A flexible implementation of several min-sketch variants.

A count-min sketch is a probabilistic counting data structure - to save space, we allow for some known probability of counting error. This allows to summarize a data stream with limited space, without growing indefinitely as we see more and more data.

For a longer description, and some of the relevant papers, see: https://sites.google.com/site/countminsketch/

Here's an example usage:

import minsketch
from functools import partial
from timeit import timeit

# Delta and epsilon and uncertainty measures
# Count-min sketches are within epsilon of the true value 
# With probability 1 - delta
delta = 10 ** -5
epsilon = 0.001

# This implementation supports several different backing table classes
# For most purposes, the Python array makes for a good default:
table_class = minsketch.sketch_tables.ArrayBackedSketchTable

# If you are willing to only support positive updates (no deletions)
# You should use conservative updates, as significantly reduces error:
update_strategy = minsketch.update_strategy.ConservativeUpdateStrategy

# If you have a reason to believe you might want lossy updating, you
# should benchmark one of the lossy updating schemes on your data:
lossy_strategy = minsketch.lossy_strategy.NoLossyUpdateStrategy
 
# So far, the hash-pair based implementation appears to outperform
# the universal hash family based one:
sketch = minsketch.double_hashing.HashPairCMSketch(
    delta, epsilon, table_class=table_class,
    update_strategy=update_strategy, lossy_strategy=lossy_strategy)
    
# Update the sketch with some data
from numpy import random
data = random.randint(0, 1000, 100000)
print(timeit(partial(sketch.update, data), number=1))

# Query the ten most common elements:
print(sketch.most_common(10))

# For a performance boost, you can also use a Counter-Sketch Hybrid
hybrid = minsketch.counter_sketch_hybrid.SketchCounterHybrid(
    minsketch.double_hashing.HashPairCMSketch(
    delta, epsilon, table_class=table_class,
    update_strategy=update_strategy, lossy_strategy=lossy_strategy))

print(timeit(partial(hybrid.update, data), number=1))
print(hybrid.most_common(10))

Documentation

See the documentation at http://minsketch.readthedocs.io/en/latest/