Sign bit and hash legnth
anatolyra opened this issue ยท 6 comments
Hi,
I'm trying to understand the significance of this particular line. It seems that the final result, represented by BigInteger, is initialized by BigInteger.ONE just for the sign of the final value.
We're trying to generate a 64-bit hash value and then store it as a long. Unfortunately, the final bit size of this -> 'new PerceptiveHash(64)', is 65 bits.
Ignoring the sign bit, by calling 'getHashValue().longValue()' produces different accuracy results for the same two image.
Am I missing something? Or was this the intended behavior?
Thanks!
If you are having a BigInteger starting with just a 0 and shift it to the left you end up with a hash of 00 .
As 0's and 1's both hold information and to comply with consistent hash lengths I did not want to drop any data. "00010" and "0010" ( -> 10 ) are not the same hashes but would be considered equal if there is no leading 1.
Consider this in a normal base 10 number space.
1856 can also be represented as 0001856. In fact you can add arbitrarily many 0's to the beginning and it does not change value of the number but in our cases 0's are important!
You may also want to wait until later today. I am pushing release 2.0.0 with a rotational invariant hash, database examples, a few tweaks to speed up hash generation by a great bit and a little utility to compare the performance and accuracy of different hash algorithms. Sadly this will be a major version bump. Breaking change are introduced and hashes won't be compatible after version 2.0.0
getHashValue().longValue()'
Can you simply create a hash with a 63 bit hash? You can also drop the last bit if necessary you will loose a tiny bit of accuracy but get your desired length key. This is also what the new hash is doing.
Let me think about it a little bit.
I have taken a more in depth look at it and it seems like the sign bit is an artifact resulting from earlier versions in which I did not xor the hash but did it manually bit by bit.
The difference in the hash can be explained by the bitlength which is used to compute the normalized distance. If we extract this information and put it into an external field we should get away without the sign/padding bit.
return hammingDistance(h) / ((double)this.hashValue.bitLength()-1);
I'll need to run all unit tests to see if everything still works after the changes but it looks promising. Good catch
Sounds good! I'm waiting for version 2.0.0 then. Do you have an eta?
Thanks!
Sorry again that it took longer than expected. I added multiple new features which got a bit more tricky and real life happened. I'll hope to get everything done and published by tomorrow.
Thank you!