/hyperminhash

LogLog space version of MinHash by combining ideas from HyperLogLog and b-bit MinHash

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

hyperminhash

LogLog scaling version of MinHash by combining ideas from HyperLogLog and b-bit MinHash

Please cite: Yun William Yu & Griffin Weber. HyperMinHash: MinHash in LogLog space. (2017)
Preprint: https://arxiv.org/abs/1710.08436

Code consists of a class hyperminhash.HyperMinHash that allows:

  • __init__: specify the size and parameters of the sketch
  • update: add hashable items to sketch
  • count: estimate the count-distinct cardinality of the sketch
  • filled_buckets: return the number of buckets that have an item (mostly used for internal algorithms)
  • __add__: given two sketches, A and B, A+B merges the sketches such that every item in A or in B is now in the sum.
  • jaccard: given two sketches, A and B, returns the Jaccard index
  • serialize: returns a ByteString that can be deserialized into the original object
  • deserialize: initializes a HyperMinHash sketch based on a serialized ByteString
  • intersection: returns the intersection cardinality, Jaccard index, number of bucket matches, and union cardinality when combining two sketches.

Full details are in the Python Docstrings. Of note, the Python implementation is not fully space-optimal, as that would require bit packing. Instead, we use size uint8, uint16, uint32, uint64 types from numpy in the implementation. Note also that we depend on Python3.

To test the class, we also provide an experiments/ directory.

  • tests_full.py will regenerate data allowing recreation of Figure 6 in the paper, though it may take weeks on standard workstations. Note that you can edit the "test_reps" parameter that is passed to the hmh_test_range function within tests_full.py to a smaller number to either increase speed/decrease accuracy by running fewer repetitions, or decrease speed/increase accuracy by running additional repetitions.
  • error_plot_full.py assumes that the current directory has the output of tests_full.py, and will generate a nice matplotlib graph.

We also provide precomputed data of the type generated by tests_*.py. To use these, go to experiments_precomputed/ and run bash regen.sh.

Unit tests are in hyperminhash_tests.py, using the unittest framework.


Thanks to Seif Lotfy for providing a Golang implementation of HyperMinHash: https://github.com/axiomhq/hyperminhash