This library contains a set of sketching or hash-based algorithms.
A bloom filter checks whether an element has already been seen given a configurable false positive rate.
>>> from streamingds.bloomfilter import BloomFilter
>>> capacity = 1000
>>> error_rate = 0.01
>>> bf = BloomFilter(capacity, error_rate)
>>> bf.add('test')
>>> 'test' in bf
True
>>> 'that' in bf
False
A count-min sketch estimates the number of times an element has been seen. This implementation also keeps track of the top k elements.
>>> from streamingds.countminsketch import CountMinSketch
>>> delta = 10 ** -7
>>> epsilon = 0.005
>>> topK = 50
>>> cms = CountMinSketch(delta, epsilon, topK)
>>> cms.update('www.google.com')
>>> cms.get('www.google.com')
1
>>> cms.update('www.google.com', 12)
>>> cms.get('www.google.com')
13
>>> cms.update('www.yahoo.com', 20)
>>> cms.get_ranking()
{0: (20, 'www.yahoo.com')
1: (13, 'www.google.com')}
A HyperLogLog estimates the number of distinct items.
>>> from random import sample
>>> from streamingds.hyperloglog import HyperLogLog
>>> hll = HyperLogLog(12)
>>> num_elements = 500000
>>> elements = sample(xrange(5000000), num_elements)
>>> hll.add(*elements)
>>> hll.cardinality()
495079.71125622035
MIT. See LICENSE