seatgeek/fuzzywuzzy

token_set_ratio Degenerate Case

rogerrohrbach opened this issue · 0 comments

Referring to the description of token_set_ratio in the original blog post: if the SORTED_INTERSECTION is a strict subset of STRING2, the result ratio will be 100. E.g.,

fuzz.token_set_ratio("Deep Learning", "Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2")

yields 100. This is patently incorrect, and does not uphold the purported intuition ("because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar").

Looking at fuzz._token_set, we see that it returns

max(
    [
        ratio_func(sorted_sect, combined_1to2),
        ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
)

It appears the assumption is that the string remainder will never be empty. Perhaps something like this is more appropriate:

max(
    [
        0 if sorted_sect == combined_1to2 else ratio_func(sorted_sect, combined_1to2),
        0 if sorted_sect == combined_2to1 else ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
)