dynatrace-oss/hash4j

`merge` methods for SimilarityHash variants

andimiller opened this issue · 1 comments

Problem

After adding UltraLogLog support to Apache Pinot I've been looking at adding some of the MinHash variants, but to do this I need a reliable way to merge them together when running SQL queries, or merging rows.

Solution

I'd like the SimilarityHasher interface to also have a merge method that takes two byte[] and returns a byte[] that represents the merged state.

Alternatives

  • I've tried implementing the merge functions myself, and run into problems like #169
  • I did consider a half way solution of just streaming hashes into it, but that's also not available in the current interface
oertl commented

Merging is currently not supported. In general, the finalization steps in all similarity hashing algorithms can truncate information so that the resulting signatures cannot be further merged but require less memory. A possible solution would be to introduce an intermediate representation that can be merged.