dynatrace-oss/hash4j

MinHash output is not mergeable with hash sizes under 64

Closed this issue · 3 comments

Describe the bug
The output of MinHash.compute cannot be merged with another output on hash sizes of under 64

To Reproduce

import com.dynatrace.hash4j.hashing.Hasher64;
import com.dynatrace.hash4j.hashing.Hashing;
import com.dynatrace.hash4j.similarity.ElementHashProvider;
import com.dynatrace.hash4j.similarity.SimilarityHasher;
import com.dynatrace.hash4j.similarity.SimilarityHashing;
import com.dynatrace.hash4j.util.PackedArray;

import java.util.Arrays;
import java.util.stream.IntStream;

public class MinHashBugRepro {

    public static byte[] merge(int components, int bits, byte[] left, byte[] right) {
        PackedArray.PackedArrayHandler pah = PackedArray.getHandler(bits);
        byte[] output = pah.create(components);
        IntStream.range(0, components).forEach(idx ->
            pah.set(output, idx, Math.min(pah.get(left, idx), pah.get(right, idx)))
        );
        return output;
    }

    public static void main(String[] args) {
        SimilarityHasher simHasher32 = SimilarityHashing.minHash(10, 32).createHasher();
        SimilarityHasher simHasher64 = SimilarityHashing.minHash(10, 64).createHasher();
        Hasher64 hasher = Hashing.wyhashFinal4();

        Long hello = hasher.hashCharsToLong("hello");
        Long world = hasher.hashCharsToLong("world");

        // when using a 32-bit hash, this merge function doesn't work
        byte[] one32 = simHasher32.compute(ElementHashProvider.ofValues(hello));
        byte[] two32 = simHasher32.compute(ElementHashProvider.ofValues(world));
        byte[] three32 = simHasher32.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged32 = merge(10, 32, one32, two32);
        System.out.println(Arrays.equals(merged32, three32));

        // when using a 64-bit hash this merge function does work
        byte[] one64 = simHasher64.compute(ElementHashProvider.ofValues(hello));
        byte[] two64 = simHasher64.compute(ElementHashProvider.ofValues(world));
        byte[] three64 = simHasher64.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged64 = merge(10, 64, one64, two64);
        System.out.println(Arrays.equals(merged64, three64));
    }
}

Expected behavior
I expect the data stored in the byte[]s to be mergeable, even if the library does not include a merge function

Additional context
I've tried writing different merge functions where I decode it signed, unsigned, swap the byte order, and have been unable to find a version that would work with the 32-bit output.

Possible fix
I am reasonably sure that this is because of this section:

          if (hash < work[i]) {
            work[i] = hash;
          }

and that it would work if it used an unsigned lessthan, rather than the signed one, or if it masked before comparing

Even more context
I ran across this when testing semilattice + distributive laws in https://github.com/andimiller/schrodinger
SuperMinHash merges fine with this merge method on all bit sizes, it does not work on SimHash or FastSimHash

oertl commented

Merging is not supported and should also not work for SuperMinHash signatures.

correction, it doesn't work on SuperMinHash, my tests weren't thorough enough

closing this to leave #170 since it's not currently supported