token_set_ratio Degenerate Case
rogerrohrbach opened this issue · 0 comments
rogerrohrbach commented
Referring to the description of token_set_ratio
in the original blog post: if the SORTED_INTERSECTION
is a strict subset of STRING2
, the result ratio will be 100. E.g.,
fuzz.token_set_ratio("Deep Learning", "Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2")
yields 100. This is patently incorrect, and does not uphold the purported intuition ("because the SORTED_INTERSECTION
component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar").
Looking at fuzz._token_set
, we see that it returns
max(
[
ratio_func(sorted_sect, combined_1to2),
ratio_func(sorted_sect, combined_2to1),
ratio_func(combined_1to2, combined_2to1)
]
)
It appears the assumption is that the string remainder will never be empty. Perhaps something like this is more appropriate:
max(
[
0 if sorted_sect == combined_1to2 else ratio_func(sorted_sect, combined_1to2),
0 if sorted_sect == combined_2to1 else ratio_func(sorted_sect, combined_2to1),
ratio_func(combined_1to2, combined_2to1)
]
)