barrust/pyprobables

Wrong result with large filter

mrqc opened this issue · 7 comments

mrqc commented

I expect that if I ask the filter to check for a membership and it tells me FALSE, then its definitely NOT a member. I did the following:

def verifyMembership(key):
    global bloom
    if key in bloom:
        print('Its possibly in')
    else:
        print('Definitly not in')

key = 'some'
filterFile = 'index.dat'
bloom = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(key)
bloom.add(key)
verifyMembership(key)
bloom.export(filterFile)

I called my script twice and the output is:

Definitly not in
Its possibly in
Definitly not in
Its possibly in

But I would expect:

Definitly not in
Its possibly in
Its possibly in
Its possibly in

If i am reducing the est_elements to lets say 10000, then its fine.

So running your script with a few changes (mostly not using the global variable) I got the results you were expecting. The code I used:

def verifyMembership(blm, key):
    if key in blm:
        print('Its possibly in')
    else:
        print('Definitly not in')

key = 'some'
filterFile = 'index.dat'
blm = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(blm, key)
blm.add(key)
verifyMembership(blm, key)
blm.export(filterFile)

# test loading it
blm2 = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(blm2, key)
blm2.add(key)
verifyMembership(blm2, key)
blm2.export(filterFile)

So far, I am unable to replicate.

It could be something that was fixed in version 0.4.1 which I haven't pushed yet. I will cut that release and hopefully that will fix your issue. You would need to update your version of pyprobables.

mrqc commented

Maybe thats the reason. Hopefully in 0.4.1 its fixed. But one question: If you run this (adapted) script twice:

def verifyMembership(key, bloomFilter):
    if key in bloomFilter:
        print('Its possibly in')
    else:
        print('Definitly not in')

key = 'some'
filterFile = 'index.dat'
bloomFilter = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(key, bloomFilter)
bloomFilter.add(key)
verifyMembership(key, bloomFilter)
bloomFilter.export(filterFile)

...you get the expected result, right? If so, then I am fine and looking forward to 0.4.1. Bcz for me, the run looks like this (very strange to me):

$ rm index.dat 
$ python3 parse.py 
Definitly not in
Its possibly in
$ python3 parse.py 
Definitly not in
Its possibly in
$ python3 parse.py 
Its possibly in
Its possibly in
$
mrqc commented

Where parse.py is the code provided.

So in my version of the script it ran both back to back so it only had to run once. When I ran your version twice, I didn't see the issue either. Version 0.4.1 has been pushed and hopefully fixes what you are seeing.

As for your other reply, I am not sure I understand what you are referencing about parse.py

mrqc commented

Thanks. Appreciate that! parse.py is simply the filename of my python script. ;)

mrqc commented

Yea, that fixed it! Thanks!