seatgeek/thefuzz

The library doesn't use Levenshtein at all, or at least it's inconsistent

R-N opened this issue · 3 comments

R-N commented

This library is described as fuzzy string matching with Levenshtein distance. However, it doesn't seem to use Levenshtein at all?

fuzz.ratio("tide", "diet") returns:

  • 50 with python-Levenshtein installed
  • 25 without python-Levenshtein

This is because python-Levenshtein uses Indel distance for ratio instead of Levenshtein. Issue 47 in python-Levenshtein mentioned that "Levenshtein.ratio is based upon the Indel distance and not on the Levenshtein distance". (That's the correct repo I think). Indel distance is similar to Levenshtein, but they are not the same and may produce different results.

But it doesn't seem to just end there. difflib, the fallback for when you don't have python-Levenshtein installed, doesn't seem to use Levenshtein distance either. Issue 128 in the older repository/version of this package mentioned that difflib uses "Ratcliff-Obershelp algorithm, not the Levenshtein distance".

To conclude:

  • This library doesn't seem to use Levenshtein distance at all.
  • Installing python-Levenshtein doesn't just simply give speed up. It will produce different results because different algorithms are used.

Note: This library uses ratio function from Levenshtein package or difflib for all of its function with ratio in its name.

Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2. Still I agree this is misleading. I think to a part this stems from python-Levenshtein, where Levenshtein.distance calculates the Levenshtein distance, but Levenshtein.ratio calculates the normalized Indel similarity (In my defense this was implemented before I became the maintainer of python-Levenshtein 😅 )

R-N commented

Note that the Indel distance is the same as the Levenshtein distance with substitutions weighted as 2.

Yeah, I also mentioned they're similar. Still, they're not the same.

I think to a part this stems from python-Levenshtein, where Levenshtein.distance calculates the Levenshtein distance, but Levenshtein.ratio calculates the normalized Indel similarity

Yup. I mentioned that too. I can't help but wonder why. The library is literally named Levenshtein. It could've been an option instead of the default, having "Indel" in the function name like it does for Jaro, or as an optional parameter.

Well, I think the readme should at least be changed to tell people about this. It's where they'd read first and currently it says "It uses Levenshtein Distance" at the very beginning, while it doesn't.

Yes no clue why the original author wrote it like this 🤷‍♂️