rapidfuzz/RapidFuzz

Modification of Hamming Distance

Spill-Tea opened this issue · 4 comments

By definition, Hamming distance should only evaluate substitutions. The Latest Modification to Hamming Distance function fundamentally changes the inherent definition, which normally assumes strings must be the same length, by now accepting insertions/deletions into a new not quite pseudo-Levenshtein distance metric.

Could these latest modifications instead define a new and separate String distance metric?

The modification in question can be found at this commit

There are generally two way to handle strings of different length with the Hamming distance:

  • extend the short sequence with dummy characters (maybe talking about insertions / deletions was not the best here)
  • throw an exception

I thought that most of the time it is actually preferable to extend the shorter sequence with dummy characters, so it still delivers reasonable results. The argument for the exception would be that this could be an error in the user code, since the hamming distance is used with strings of different length.

Is there any other reason why you think the exception is preferable?

I could add an extra argument to throw an exception / handle strings of different length instead.

I think thats a fair compromise - adding a boolean flag, like pad=true, which pads strings of unequal length if true, else raises value error. At first, I thought that might be better suited for the role of the string processor callable argument, but those function before comparison between input strings.

An example snippet

min_len, dist = sorted((len(s1), len(s2)))
if not pad and min_len != dist:
    raise ValueError("Strings Must be the Same Length")

This is now implemented in the main branch:

>>> from rapidfuzz.distance import Hamming
>>> Hamming.distance("aa", "aaa")
1
>>> Hamming.distance("aa", "aaa", pad=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "distance/metrics_cpp.pyx", line 661, in rapidfuzz.distance.metrics_cpp_avx2.hamming_distance
ValueError: Sequences are not the same length.