seatgeek/fuzzywuzzy

What is the max possible value (upper bound) for fuzz.ratio?

sillybun opened this issue · 4 comments

It would be helpful to know what is the max possible value (upper bound) for:

$fuzz.ratio(sx, sy)$

where the length of $sx$ is $x$ and length of $sy$ is $y$ (x <= y).

It seems that $100 * \sqrt(x / y)$ is a roughly approximation.

fuzz.ratio is a normalized version of the InDel-Distance (similar to Levenshtein but without Substitutions) scaled to the range 0-100:

round(100 * (1 - InDelDist / (len1 + len2)))

so the upper bound is 100

Rereading your question I think you might mean a length based similarity score which is a upper bound for the similarity. Both for Levenshtein and InDel Distance the distance between two strings is at least the length difference, so in your example with
len1 <= len2 the upper bound can be calculated as:

100 * (1 - (len2 - len1) / (len1 + len2))

Thanks very much! It helps a lot!

@sillybun btw I use this as early exit condition in RapidFuzz when a score_cutoff argument is provided to the function ;)