seomoz/simhash-py

Explanation in README.md is somehow incomprehensive.

Opened this issue ยท 12 comments

How can I actually use this library to detect duplicate document,
The function insert only accept integers so I suppose I should implement minHash myself(Is it true??) and then use your simHash library to detect near-duplicates.
How can I connect this to simhash-db you provided? does that needs something users themselves should implement??
Sorry for being noob in Near duplicate detection bussiness ๐Ÿ˜„ and Thank you for sharing your work for us ๐Ÿ‘

Up same question!

rth commented

See this blog post https://moz.com/devblog/near-duplicate-detection/ for a more general introduction and feel free to make a PR to improve the readme.

Here is an example of an estimator that uses simhash-py for near duplicate detection, it takes as input a sparse csr matrix created by scikit-learn's CountVectorizer or HashingVectorizer. I'm not saying there aren't better ways to use it though..

I think the most confusing aspect of how simhashing is talked about is the fact that there are two aspects to it, but the term 'simhash' is often used to refer to both: 1) a function to generate a simhash from input, and 2) methods for identifying near-duplicates from a set of simhashes.

To the first aspect, the input to the simhash hash function is a vector of hashes representative of the entity. There aren't really constraints on the size of this vector of hashes, but how this vector of generators has the most impact on the quality of the near-duplicates found. The Moz blog post that @rth referenced describes a number of ways in which that vector might be generated, but the hash of a rolling window of words is a pretty reasonable way.

To the second aspect, simhash provides a way to identify all the pairs of simhashes that are 'close enough' to be near-duplicates. This was the most fun and interesting part to code up, but it is agnostic to how simhashes were generated, so it doesn't have any impact on the quality of the matches found.

It is also important to note that when running the near-duplicate check, all the input simhashes must have been generated with the same method to have meaningful results. For example, if you have two simhashes, one generated with a bag of words and the other generated with shingles, they are not compatible.

I will try very hard to find some time to expand the README for this repo, since I, too, find it to be less than useful.

@dlecocq : I think the first aspect you described is about generating minhash of the document (not generating simhash, as you mentioned). The second aspect is about taking the minhashes of the documents and predicting whether they are similar or not. The "simhash" is the total process that should do both the first and the second aspect you talked about.

Given that this library name was "simhash", I expected to input it a list of documents and obtain their pairwise similarity as the output. But it seems that this library does only the second aspect and leaves the user to do the first part on their own.

BTW, the blog post link mentioned by @rth is dead. Does anyone have an updated link?

rth commented

Yes, the original link tp seomoz blogs now returns ERR_SPDY_PROTOCOL_ERROR for me as well.

Here is the cached url.

@bikashg - making an effective simhash for a document depends greatly on the types of documents used and what aspect of the documents is meant to be matched, so making a method that works well in all situations is not practical. With that said, there is a top-level shingle helper that can help quite a bit:

import re
from simhash import compute, shingle, unsigned_hash

# If document is a string with just words, this provides a VERY basic tokenization and subsequent simhash
tokens = re.compile('\W+').split(document)
compute(map(unsigned_hash, shingle(tokens)))

With regard to 'only' performing the last simhash bit, this library also provides the ability to find all the matches once the simhashes are generated.

@dlecocq : Thanks for the explanation. Yes and I also went through the test cases in test.py which further helped to understand the whole pipeline.
My use of world 'only' was probably harsh, sorry for that; didn't mean it that way. Cheers for putting the effort to build this library.

The link to https://moz.com/devblog/near-duplicate-detection/ still works for me as of this morning. Maybe it was a transient failure?

rth commented

@b4hand For me it also works with wget and Firefox 57 but not Chrome 59 (see error above); probably some SPDY related server misconfiguration..

As far as I know, we're not using SPDY. I've reported the issue to the team in charge of that page, it should get fixed. Thanks for letting us know about the issue.

@bikashg - no worries - maybe we could have a 'quickstart' section to the README? I'd welcome a PR, as it's sometimes hard to make good onboarding documentation after having worked on all the internals of a library.

@b4hand - https://moz.com/devblog/near-duplicate-detection/ does not work for me, either using Chrome 66.0.3359.117 on Mac OS 10.13.4

FYI, the devblog link should be working again for Chrome.