False positives

Question

False positives

dholms opened this issue 4 years ago · 3 comments

I'm getting a much higher false positive rate than I would expect from a bloom filter of the size that I'm using

I'm using a 1024-bit bloomfilter with 16 hashes and 20 elements in each filter.

I'm running a test which adds 20 elements to a filter, checking before adding each that the item isn't already in the filter.

After running the test 500 times, there are ~4 collisions.

Given a bloom filter with those parameters, there should only be about a 1/1.3 billion chance of collision (https://hur.st/bloomfilter/?n=20&p=&m=1024&k=16)

Here's the short script:

    let collisions = 0
    for(let i=0; i< 500; i++) {
      const filter = new BloomFilter(1024, 16)
      const dict = {}
      for(let j=0; j< 20; j++) {
        const str = Math.floor(Math.random() * 1000000000).toString(16)
        if(filter.test(str) && dict[str] !== true){
          console.log("COLLISION: ", str)
          collisions++
        }
        filter.add(str)
        dict[str] = true
      }
      console.log(i)
    }
    console.log('done: ', collisions)

Answer 1 · 2020-08-11T12:25:36.000Z

Interesting. The only way I was able to fix this was by using k independent hash functions, instead of the current version which uses 2 independent hash functions a and b internally, and combines them to generate k hash functions using:

h_i(d) = a(d) + i * b(d)

As far as I know this is common practice, and the idea is mentioned in Building a Better Bloom Filter. I'm guessing that this optimisation has some effect on the false positive rate as the k hash functions are not completely independent.

Answer 2 · 2020-08-11T13:03:04.000Z

Perhaps @Gopiandcode would be interested in this. He is the author of Bloom filters debunked: Dispelling 30 Years of bad math with Coq!

Answer 3 · 2023-03-24T13:33:20.000Z

I'd suggest you try the benchmarking using a CSRNG from crypto instead of Math.random(). With such complex bit patterns, the randomness provided by simple RNGs is often not enough.