/python-cuckoo

Cuckoo Filter implementation that Python 3.x only.

Primary LanguagePythonMIT LicenseMIT

python-cuckoo

python-cuckoo is an implementation of a Cuckoo filter in python 3, created by Michael The and the forked version proposes for:

  • Apply simplification suggestion for David Eppstein.
  • Serialize and unserialize so we can do persistent storage.

Cuckoo filters serve as a drop-in replacement for Bloom filters. Just like Bloom filters, we can add items to the filter and then ask the filter whether that item is in the filter or not. Just like Bloom filters, there is a (very small and tunable) chance this will return a false positive. Unlike regular Bloom filters, Cuckoo filters also support deletion of items. Cuckoo filters use less space than Bloom filters for low false positive rates.

Installation

>>> pip install python-cuckoo

Usage

>>> import cuckoofilter
>>> cf = cuckoofilter.CuckooFilter(capacity=100000, fingerprint_size=1)

>>> cf.insert('Bin Fan')
66349

>>> cf.contains('Bin Fan')
True

>>> cf.contains('Michael The')
False

>>> len(cf.serialize().read())
100041

>>> cuckoofilter.CuckooFilter.unserialize(cf.serialize()).contains('Bin Fan')
True

>>> cf.delete('Bin Fan')
True

>>> cf.contains('Bin Fan')
False

Why use Cuckoo filters

The short answer: if you don't know whether you should use a Bloom filter or a Cuckoo filter, use a Cuckoo filter. For a well-written and visual explanation, check out Probabilistic Filters By Example.

Cuckoo filters are very similar to Bloom filters. They are both great at reducing a disk queries. Some example usages of Bloom filters:

  • Checking whether a URL is malicious or not (Google Chrome)
  • Deciding whether an item should be cached or not (Akamai)
  • Keep track of what articles a user has already read (Medium)

Why not use Cuckoo filters

As a Cuckoo filter is filled up, insertion will become slower as more items need to be "kicked" around. If your application is sensitive to timing on insertion, choose a different data structure.

Cuckoo filters might reject insertions. This occurs when the filter is about to reach full capacity or a fingerprint is inserted more than 2b times, where b is the bucket size. This limitation is also present in Counting Bloom filters. If this limitation is unacceptable for your application, use a different data structure.

Testing & profiling

Python-cuckoo comes with a test suite (cuckoofilter/tests/). We recommend using py.test to run unit tests.

pip install pytest
pytest cuckoofilter/

To generate a test coverage report, install the pytest coverage plugin and generate an html report.

pip install pytest-cov
pytest --cov-report html cuckoofilter/

The report will be created in a folder called htmlcov. Open htmlcov/index.html to inspect what parts of the library are tested.

To find out what parts of the library are slow, we need to profile our library. To do this, we can use cProfile by running python -m cProfile example.py. For a visualization of this profiling information, we can use snakeviz.

pip install snakeviz
python -m cProfile -o out.profile example.py
snakeviz out.profile

Original paper

Cuckoo filters were first described in:

Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014, December).
Cuckoo filter: Practically better than bloom.
In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (pp. 75-88). ACM.

Their reference implementation in C++ can be found on github.

See also