seatgeek/fuzzywuzzy

Feature to robustly handle token ordernings

Opened this issue · 0 comments

Hi,

I use fuzzywuzzy to match full names extracted from documents to names in a database. Discarding order is import for this matching goal. Typically I use fuzz.token_sort_ratio to obtain:

fuzz.token_sort_ratio("fuzzy wuzzy", "wuzzy fuzzy")
> 100

As the names suggest this function sorts the individual tokens, however in multiple instances this gave undesirable results, e.g.

fuzz.token_sort_ratio("willy` wonka", "willy zonka")
> 91
fuzz.token_sort_ratio("willy` wonka", "willy vonka")
> 45

To cope with this I would propose a robust token_sim_ratio function that sorts the second list of tokens according to its similarity with the tokens in the first list. I have currently implemented a light-weight solution based on ngram-matching that is robust to mistakes in the first letter of tokens.

My question; is there a general appetite for such a functionality, and if so should I proceed with making a PR for this feature?