find_all() returns wrong results
Opened this issue · 4 comments
import simhash
corpus = simhash.Corpus(6, 3)
corpus.insert(1)
corpus.insert(2)
corpus.find_all(2)
Running this code returns:
[2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L]
This is actually the current intended behavior. What happens in find_all
is that it walks through all the tables for the block/bit-difference configuration (there are blocks C bit-difference
tables), and for each table finds all matches within the bit-difference.
For the configuration (6, 3)
, there are 6 C 3 = 20
tables, which means that we actually expect to find identical matches exactly 20 times:
corpus = simhash.Corpus(6, 3)
corpus.insert(2)
# This comes out to be 20
len(corpus.find_all(2))
In the original example you provided, we find exactly 20 entries for 2
, and another 10 for 1
. Since they differ by exactly one bit, only one of the six blocks will contain the differing bit, so most tables will turn up a map for that query.
Does that help to make sense of it?
Thank you for the explanation. I understand why it happens, but why is this the intended behavior?
I guess when I say it's the current intended behavior, I mean that it is working as expected :-)
I would welcome a PR to make it return a set
It appears that simhash.Corpus
has been removed, so I'm guessing this issue is a "Won't Fix".