seomoz/simhash-py

find_all() returns wrong results

Opened this issue · 4 comments

import simhash

corpus = simhash.Corpus(6, 3)
corpus.insert(1)
corpus.insert(2)
corpus.find_all(2)

Running this code returns:
[2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L]

This is actually the current intended behavior. What happens in find_all is that it walks through all the tables for the block/bit-difference configuration (there are blocks C bit-difference tables), and for each table finds all matches within the bit-difference.

For the configuration (6, 3), there are 6 C 3 = 20 tables, which means that we actually expect to find identical matches exactly 20 times:

corpus = simhash.Corpus(6, 3)
corpus.insert(2)
# This comes out to be 20
len(corpus.find_all(2))

In the original example you provided, we find exactly 20 entries for 2, and another 10 for 1. Since they differ by exactly one bit, only one of the six blocks will contain the differing bit, so most tables will turn up a map for that query.

Does that help to make sense of it?

Thank you for the explanation. I understand why it happens, but why is this the intended behavior?

I guess when I say it's the current intended behavior, I mean that it is working as expected :-)

I would welcome a PR to make it return a set

It appears that simhash.Corpus has been removed, so I'm guessing this issue is a "Won't Fix".