bits-and-blooms/bloom

Problem with uniformity of FNV implementation

turgon opened this issue · 7 comments

Hi Will,

Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go

The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.

I tested five different methods of finding bit locations:

  1. your current implementation
  2. an implementation using Murmur3
  3. your implementation preceding changeset ed0c3ad
  4. an implementation using Go's native FNV-1
  5. an implementation using Go's native FNV-1a

The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.

I think the problem stems from your addition of the index on line 77.
That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.

After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.

Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher,
Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1. I.e.: Uniformity is critical or the math doesn't hold up.

I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.

I hope this was clear enough, and you can find it useful. Thank you!

willf commented

Hi turgon, I will review this.

Hey @willf, any chance you've looked yet?

willf commented

@turgon thanks again for running this test. Would you be willing to adapt your code to swap out the current hash, and then include some test for uniformity as an acceptance test? I'm afraid I personally won't be able to get to it before a week or two or more.

@willf I'd be happy to.

@anacrolix looks like a nice bundle of stuff. I notice they don't have uniformity tests, too. ;)

willf commented

@turgon Wow, it's taken me a long time to get to this, and I see you went on to create your own Bloom filter code, but this code now uses the same murmurhash, and there is a test for uniformity included.

@willf Pay my bloom no mind. Just me goofing around. I'm excited you came back to this and again I really appreciate you making time to publish libraries like this one!