[Question] Has bloom filters a higher probability of collide when strings are similar

Question

[Question] Has bloom filters a higher probability of collide when strings are similar

bacloud23 opened this issue 3 years ago · 5 comments

Because if it is the case, this really suits my need.

I'm trying to discover popular searches on a website. For this, I'm using TopK algorithm which is based on Bloom Filter hashing.

I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.

Thanks a lot @Callidon !!!

Answer 1 · 2021-12-12T13:15:20.000Z

if not, how can I replace the Hash function used with another one more permissive

Answer 2 · 2021-12-13T07:32:15.000Z

Hi 👋
Thank you for your interest in our library!

I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.

Under the hood, we are using xxHash. Their website contains a full benchmark, including collision probabilities. You should be able to find all information you need there.

if not, how can I replace the Hash function used with another one more permissive

You can change the seed used by the hash functions, but I'm afraid we do not have an API for swapping the hash function. The reason is that all the data-structures implemented by our package are carefully designed around xxHash. However, you can override any classes and their methods to customize how values are hashed, but we don't provide support for this, as it is a very advanced usage of the library.

Answer 3 · 2021-12-13T07:52:23.000Z

Hello @Callidon
Thank you so much for your reply.

Ok then I will not mess with the hash function itself. I will pre-run a similarity detection on my stream (which is a stream of strings).
Closing this.
Thanks!

Answer 4 · 2021-12-13T15:18:43.000Z

In newer version (soon I hope) you'll have the possibility to override a function for using any custom hash functions as long as you return a value of type Number.

Answer 5 · 2021-12-13T15:27:54.000Z

Thanks @folkvir , I'll keep an eye. for now, I will be running a similarity String prior to feeding topK. Very close Strings would be converted to the same String.
My case is looking for top searches on a website by the way.

Many thanks again!