`merge` methods for SimilarityHash variants
andimiller opened this issue · 1 comments
andimiller commented
Problem
After adding UltraLogLog
support to Apache Pinot I've been looking at adding some of the MinHash
variants, but to do this I need a reliable way to merge them together when running SQL queries, or merging rows.
Solution
I'd like the SimilarityHasher
interface to also have a merge
method that takes two byte[]
and returns a byte[]
that represents the merged state.
Alternatives
- I've tried implementing the merge functions myself, and run into problems like #169
- I did consider a half way solution of just streaming hashes into it, but that's also not available in the current interface
oertl commented
Merging is currently not supported. In general, the finalization steps in all similarity hashing algorithms can truncate information so that the resulting signatures cannot be further merged but require less memory. A possible solution would be to introduce an intermediate representation that can be merged.