ChenghaoMou/text-dedup

refactor hash related code

Closed this issue · 3 comments

This code base utilizes several different kinds of hashes (md5, SHA-1, xxhash, sha256).
Some are implemented in redundant manners.
For instance, sha1_hash32 in minhash_spark and sha1_hash (with default d=32) in minhash.py
seem to achieve the same thing.
I think that refactoring it like the hashfunc.py in ekzhu/datasketch could be helpful.

if I prepared a PR, would you consider it?

Thanks for the PR. It is duplicated (somehow ironic considering the name) because it allows us to submit a Pyspark job without declaring additional Python files or dependencies. That also reminds me of another issue — rn the spark script imports the ngram function from the other file, and I need to change that.

But it is probably a good choice for other type of algorithms. So I will approve the PR.

Thanks again for your help!

If I understand you correctly, atleast for PySpark based minhash_spark.py, we cannot include any dependencies outside python standard library or pyspark functions right? (since each executor will also need to have it installed and coordinating that is possible but difficult).
Seems like there are some pyspark functions in this case.
xxhash64

Otherwise, I will continue another PR that
imports and uses the new consolidated hash functions and remove redundant ones, except for minhash_spark.py

For minhash_spark.py, I will

  1. implement blake2b_hash without any dependency except hashlib and struct
    I discovered that hashlib.blake2b is only marginally faster than sha1 for this purpose. also most of the time is spent in converting the raw byte output into truncated 32bit integers with python functions.
    I think i can change this by using hashlib.blake2b(data,digest_size=32) -> putting it into a raw np array -> calling numpy.frombuffer.
  2. investigate how to utilize xxhash64 for this purpose.

Most of my suggestions were implemented through #30 #35 #48
Some more are ongoing. There are more to be implemented but I think this issue can be closed