[Question] Has bloom filters a higher probability of collide when strings are similar
bacloud23 opened this issue · 5 comments
Because if it is the case, this really suits my need.
I'm trying to discover popular searches on a website. For this, I'm using TopK algorithm which is based on Bloom Filter hashing.
I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.
Thanks a lot @Callidon !!!
if not, how can I replace the Hash function used with another one more permissive
Hi 👋
Thank you for your interest in our library!
I don't want "Hello world" and "hello world" to be count twice. So if it is a collision, that would be really appropriate for my use case.
Under the hood, we are using xxHash. Their website contains a full benchmark, including collision probabilities. You should be able to find all information you need there.
if not, how can I replace the Hash function used with another one more permissive
You can change the seed used by the hash functions, but I'm afraid we do not have an API for swapping the hash function. The reason is that all the data-structures implemented by our package are carefully designed around xxHash. However, you can override any classes and their methods to customize how values are hashed, but we don't provide support for this, as it is a very advanced usage of the library.
Hello @Callidon
Thank you so much for your reply.
Ok then I will not mess with the hash function itself. I will pre-run a similarity detection on my stream (which is a stream of strings).
Closing this.
Thanks!
In newer version (soon I hope) you'll have the possibility to override a function for using any custom hash functions as long as you return a value of type Number
.
Thanks @folkvir , I'll keep an eye. for now, I will be running a similarity String prior to feeding topK. Very close Strings would be converted to the same String.
My case is looking for top searches on a website by the way.
Many thanks again!