pLSH-HDC: A Python repository from Lynsay

pLSH-HDC : Parallel Locality-Sensitive Hashing based High Dimensional Clustering

Based on Rajamaran, "Mining of Massive Datasets" - Section 3.4

A parallel implementation of LSH for High Dimensional Clustering.

Documents or sets are represented by a MinHash signature.
LSH is used to map similar signatures to similar bins.
Items which map to the same bin are considered candidate pairs for clustering.
A constraint function (currently Levenshtein distance) is applied to candidate pairs.
Items which satisfy constraint function are clustered via UnionFind.

Signatures can be pre-computed (in parallel) and stored using the MinHasher. Clusters should be built from MinHash signatures. Constraint checking currently uses the Levenshtein distance of the actual documents stored in a LevelDB database via a JSON interface.

Summary of changes:

Updated to use murmur3 hashing (for signatures and LSH)
Unicode support
De-coupled clustering from signature creation to allow parallel and pre-computation of signatures
Ability to dump/load signer state to disk
Constraint function checking for candidate pairs
Native parallel processing for constraint checks
Methods to help serialize cluster state to disk

Requires the C/C++ based pyhash, python-Levenshtein, and leveldb libraries. These can be installed via pip:

pip install pyhash python-Levenshtein leveldb

TODO: Remove LevelDB dependency, improve generality of constraint checking, update tests.

Lynsay/pLSH-HDC

pLSH-HDC : Parallel Locality-Sensitive Hashing based High Dimensional Clustering