seomoz/simhash-py

output of simhash.compute method

bikashg opened this issue · 3 comments

I printed the output of simhash.compute() method -- both its type and value. I noticed that the type is integer and value is 19 digit number (eg: 8550830854347186281) . Shouldn't it be a 64 digit fingerprint consisting of only 0s and 1s ?

Yep. It's just the integer representation of the fingerprint:

>>> bin(8550830854347186281)
'0b111011010101010101001110101101110010111110100000110000001101001'

Thanks for the reply. So, the program internally uses the binary stream (for matching) but displays the integer for printing purposes? Also, please help me understand the association between 64 bit binary and 19 digits integer.

Internally, the fingerprints are stored as a uint64_t - an unsigned 64-bit integer. These integers are compared to one another when identifying near-duplicates (by comparing the number of bits by which they differ). The ~19-digit integer is just the base-10 representation of the fingerprint.