refactor hash related code
Closed this issue · 3 comments
This code base utilizes several different kinds of hashes (md5, SHA-1, xxhash, sha256).
Some are implemented in redundant manners.
For instance, sha1_hash32
in minhash_spark and sha1_hash
(with default d=32) in minhash.py
seem to achieve the same thing.
I think that refactoring it like the hashfunc.py in ekzhu/datasketch
could be helpful.
if I prepared a PR, would you consider it?
Thanks for the PR. It is duplicated (somehow ironic considering the name) because it allows us to submit a Pyspark job without declaring additional Python files or dependencies. That also reminds me of another issue — rn the spark script imports the ngram
function from the other file, and I need to change that.
But it is probably a good choice for other type of algorithms. So I will approve the PR.
Thanks again for your help!
If I understand you correctly, atleast for PySpark based minhash_spark.py
, we cannot include any dependencies outside python standard library or pyspark functions right? (since each executor will also need to have it installed and coordinating that is possible but difficult).
Seems like there are some pyspark functions in this case.
xxhash64
Otherwise, I will continue another PR that
imports and uses the new consolidated hash functions and remove redundant ones, except for minhash_spark.py
For minhash_spark.py
, I will
implementblake2b_hash
without any dependency except hashlib and struct
I discovered that hashlib.blake2b is only marginally faster than sha1 for this purpose. also most of the time is spent in converting the raw byte output into truncated 32bit integers with python functions.
I think i can change this by usinghashlib.blake2b(data,digest_size=32)
-> putting it into a raw np array -> callingnumpy.frombuffer
.- investigate how to utilize xxhash64 for this purpose.