reymond-group/drfp

numpy.int32?

rmrmg opened this issue · 3 comments

rmrmg commented

hash function in drfp/fingerprint.py we have:

hash_values.append(int(blake2b(t, digest_size=4).hexdigest(), 16))

which produce values in range [0, 4G], then based on the list numpy array is created:

np.array(hash_values, dtype=np.int32)

but np.int32 has range [-2G,2G]

On linux it is automatically wrapped into [-2G,2G] range but on windows it failed with overflow error.

Is [-2G,2G] range correct and expected id est can I change the first line into:
hash_values.append(int(blake2b(t, digest_size=4).hexdigest(), 16) - 2_147_483_647 )
or should I change range in array to uint32:
np.array(hash_values, dtype=np.uint32)
Which of above should I do?

hash function in drfp/fingerprint.py we have:

hash_values.append(int(blake2b(t, digest_size=4).hexdigest(), 16))

which produce values in range [0, 4G], then based on the list numpy array is created:

np.array(hash_values, dtype=np.int32)

but np.int32 has range [-2G,2G]

On linux it is automatically wrapped into [-2G,2G] range but on windows it failed with overflow error.

Is [-2G,2G] range correct and expected id est can I change the first line into: hash_values.append(int(blake2b(t, digest_size=4).hexdigest(), 16) - 2_147_483_647 ) or should I change range in array to uint32: np.array(hash_values, dtype=np.uint32) Which of above should I do?

hello,rmrmg;
I had the same problem. It failed with overflow error on the windows. Have you solved the problem?

I got the same error:
OverflowError: Python int too large to convert to C long

Still getting the same issue. Failing unit tests on my machine (Windows 10, python 3.7), looks like the hash values returned by blake2b are different to what the original dev was getting on their machine. I tried changing in hash():

return np.array(hash_values, dtype=np.int32)
to
return np.array(hash_values, dtype=np.int64)

which fixed the error, but it still fails unit tests so is clearly getting different encoding to what they originally got, thus making it pretty unreliable. I tried using the encodings for ML and got terrible results, so hard to tell if this is due to encoding or the description not being suitable for my system.