This is python 3 library to support measuring the similarity of pieces of text based on their MinHash signature generated from their k-shingle form.
Text can be represented in MinHash form by creating a new ShingledText
instance and passing in text as well as optional values for the random_seed
for hashing (default 5), the shingle_length
aka the k in k-shingles (default 5), and the minhash_size
for the size of the MinHash signature (default 200). Variables for the list form of the minhash
and iterator representation of shingles
are available for the object. A similarity
function is also available to compute the Jaccard similarity of the two MinHash objects.
This library utilizes Python 3, NLTK, and Murmur Hash