MinHash output is not mergeable with hash sizes under 64
Closed this issue · 3 comments
Describe the bug
The output of MinHash.compute
cannot be merged with another output on hash sizes of under 64
To Reproduce
import com.dynatrace.hash4j.hashing.Hasher64;
import com.dynatrace.hash4j.hashing.Hashing;
import com.dynatrace.hash4j.similarity.ElementHashProvider;
import com.dynatrace.hash4j.similarity.SimilarityHasher;
import com.dynatrace.hash4j.similarity.SimilarityHashing;
import com.dynatrace.hash4j.util.PackedArray;
import java.util.Arrays;
import java.util.stream.IntStream;
public class MinHashBugRepro {
public static byte[] merge(int components, int bits, byte[] left, byte[] right) {
PackedArray.PackedArrayHandler pah = PackedArray.getHandler(bits);
byte[] output = pah.create(components);
IntStream.range(0, components).forEach(idx ->
pah.set(output, idx, Math.min(pah.get(left, idx), pah.get(right, idx)))
);
return output;
}
public static void main(String[] args) {
SimilarityHasher simHasher32 = SimilarityHashing.minHash(10, 32).createHasher();
SimilarityHasher simHasher64 = SimilarityHashing.minHash(10, 64).createHasher();
Hasher64 hasher = Hashing.wyhashFinal4();
Long hello = hasher.hashCharsToLong("hello");
Long world = hasher.hashCharsToLong("world");
// when using a 32-bit hash, this merge function doesn't work
byte[] one32 = simHasher32.compute(ElementHashProvider.ofValues(hello));
byte[] two32 = simHasher32.compute(ElementHashProvider.ofValues(world));
byte[] three32 = simHasher32.compute(ElementHashProvider.ofValues(hello, world));
byte[] merged32 = merge(10, 32, one32, two32);
System.out.println(Arrays.equals(merged32, three32));
// when using a 64-bit hash this merge function does work
byte[] one64 = simHasher64.compute(ElementHashProvider.ofValues(hello));
byte[] two64 = simHasher64.compute(ElementHashProvider.ofValues(world));
byte[] three64 = simHasher64.compute(ElementHashProvider.ofValues(hello, world));
byte[] merged64 = merge(10, 64, one64, two64);
System.out.println(Arrays.equals(merged64, three64));
}
}
Expected behavior
I expect the data stored in the byte[]
s to be mergeable, even if the library does not include a merge function
Additional context
I've tried writing different merge functions where I decode it signed, unsigned, swap the byte order, and have been unable to find a version that would work with the 32-bit output.
Possible fix
I am reasonably sure that this is because of this section:
if (hash < work[i]) {
work[i] = hash;
}
and that it would work if it used an unsigned lessthan, rather than the signed one, or if it masked before comparing
Even more context
I ran across this when testing semilattice + distributive laws in https://github.com/andimiller/schrodinger
SuperMinHash
merges fine with this merge
method on all bit sizes, it does not work on SimHash
or FastSimHash
Merging is not supported and should also not work for SuperMinHash signatures.
correction, it doesn't work on SuperMinHash, my tests weren't thorough enough
closing this to leave #170 since it's not currently supported