False positives
dholms opened this issue · 3 comments
I'm getting a much higher false positive rate than I would expect from a bloom filter of the size that I'm using
I'm using a 1024-bit bloomfilter with 16 hashes and 20 elements in each filter.
I'm running a test which adds 20 elements to a filter, checking before adding each that the item isn't already in the filter.
After running the test 500 times, there are ~4 collisions.
Given a bloom filter with those parameters, there should only be about a 1/1.3 billion chance of collision (https://hur.st/bloomfilter/?n=20&p=&m=1024&k=16)
Here's the short script:
let collisions = 0
for(let i=0; i< 500; i++) {
const filter = new BloomFilter(1024, 16)
const dict = {}
for(let j=0; j< 20; j++) {
const str = Math.floor(Math.random() * 1000000000).toString(16)
if(filter.test(str) && dict[str] !== true){
console.log("COLLISION: ", str)
collisions++
}
filter.add(str)
dict[str] = true
}
console.log(i)
}
console.log('done: ', collisions)
Interesting. The only way I was able to fix this was by using k independent hash functions, instead of the current version which uses 2 independent hash functions a
and b
internally, and combines them to generate k hash functions using:
h_i(d) = a(d) + i * b(d)
As far as I know this is common practice, and the idea is mentioned in Building a Better Bloom Filter. I'm guessing that this optimisation has some effect on the false positive rate as the k hash functions are not completely independent.
Perhaps @Gopiandcode would be interested in this. He is the author of Bloom filters debunked: Dispelling 30 Years of bad math with Coq!
I'd suggest you try the benchmarking using a CSRNG from crypto
instead of Math.random()
. With such complex bit patterns, the randomness provided by simple RNGs is often not enough.