seomoz/simhash-py

Choice of the hashing function for shingles

rth opened this issue · 4 comments

rth commented

simhash-py currently uses the Python internal hash function to hash shingles. While this works well on Python 2.7, it is not ideal when extending this implementation to later python versions as,

  • As of Python 3.2 internal hash function is salted with a random value:

    Note: By default, the hash() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

    This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
    Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

    See also PYTHONHASHSEED.

    This can be however disabled setting the environment variable PYTHONHASHSEED=0 .

  • There is actually no guarantee on the implementation of the builtin hash function, and it appears to have been changed in python 3.4:

    • Python 2.7:

      >>> hash('test')
      2314058222102390712
      
    • Python 3.3 with PYTHONHASHSEED=0,

      >>> hash('test')
      2314058222102390712
      
    • Python 3.4 with PYTHONHASHSEED=0,

      >>> hash('test')
      4418353137104490830
      

      which leads to TestFunctional.test_added_text tests failing on Python 3.4-3.5 in PR #27 , as the number of different bytes found changes from 3 to 5 with this different hashing mechanism.

As far as I understand, the actual algorithm used to hash shingles is not critical, as long as it is consistent, isn't it? How many bytes should the hash be: 64bit? Is it possible to use less? For instance, could one use the 32-bit version of Murmurhash3 (which is used to hash character or word n-grams in scikit-learn and spark) ?

Edit: a subsidiary question: assuming that we have a 32bit hashed shingles, what would the best values to provide to 64bit simhash?

  • 32bit of 0, then 32 bit of hash
  • 32bit of hash repeated twice
  • some function of this 32 bit hash returning 64 bit?

I agree that it makes sense to not use the built-in Python hash function, so that simhash-py can be independent of Python versions. I'm not sure how to answer your subsidiary question as it is a little unclear, but the first two options both sound like a bad idea. Also, I think we're fairly consistent on using 64-bit hashes everywhere, so it would be best to use the murmurhash64A variant that outputs a 64-bit hash natively. It's trivial to reduce a 64-bit hash to a 32-bit hash, it's impossible to regain the information lost from starting with a 32-bit hash and trying to produce a 64-bit hash.

rth commented

@b4hand Thanks your for the response! Sorry for the confusion, for simhash-py it definitely makes sense to use a hashing function producing 64-bit hashes: MurMurhash v2 64A or maybe the MurMurhash v3 128bit: apparently there is no 64bit version for v3 because the 128bit is mostly as fast and can be truncated. In the later case, here are some Cython wrappers for the C++ code of v3 from scikit-learn (under BDS 3-clause license) which could be adapted for 128bit / truncated to 64bit hashes.

My subsidiary question was a bit off topic:
I'm trying to use simhash-py with external tokenizer / hashing of shingles (~word n-grams) to process non web text documents . In particular, I am considering using scikit-learn's HashingVectorizer as that would allow to do some additional pre-processing / filtering of n-grams. The difficultly is that there, a 32bit hash value is computed and then a modulo n_features is applied (to reduce the size of the resulting sparse arrays), so that the final hash value is between [0, n_features ~ 2^20]. This produces some collisions , but is considered to be an acceptable trade-off for text classification (and I imagine it might not be too critical for duplicate detection either). Even if we can't make a true 64bit hash in this case, I was wondering how to best map this hash value from [0, n_features] to [0, 2^64-1] (so it can be used in simhash-py), and I agree that my suggestions above were not great. Actually in the original simhash paper they also do some rescaling of the hashes,

[..] it would be desirable for our hash function to assign very different values to any two files with different extensions. To this end, we compute a simple hash of the (first three characters of) a file's extention, with a value between 0 and 1. The distribution of these values should be fairly uniform across the space of possible extensions.
We then multiply this extension value by MAXINT , and add that to our previous key value. Since we only care about the distance between keys, and not their actual values, this will not affect the relation between files with the same extension, while it will tend to widely separate fles of different extensions. This has the effect of more evenly distributing our key values across the space of 32-bit integers, and making cross-extention key matches very unlikely.

so I'll try to just multiply my hash values by (2^64-1)/n_features to make them more evenly distributed in the 64bit space and see if that works. Thanks for the feedback!

gosom commented

As stated above and as I understand from the simhash paper the choice of hash functions is not crucial.

I tried using:

  • farmhash (64bit)
  • xxhash (64bit)

and I am getting much larger hamming distance between shingles that are "closer" when using the built in hash function.

I do not feel that this is random in the specific shingles since I tested in a lot of shingles.

Have you noticed this behavior?

The particular choice of hash function shouldn't matter much to the algorithm itself.

I personally, haven't investigated any alternative hashing functions since the code is working for us. Can you quantify how much closer the built-in hash function is versus one of these other hash functions?

Regardless, for our use case, we do store these hash values over time, so we can't just arbitrarily swap out the hashing function in our implementation. That's why I said I would need to baseline the new hashing function to ensure it produced similar results, and also we would probably not switch over to the new hashing function immediately.

I haven't had a chance to do any additional investigation on this since it isn't a high priority, and we're not currently running into any issues. Ultimately, I would like to see the version dependence on Python removed, but it's not pressing either.