seatgeek/fuzzywuzzy

Feature Suggestion sort order matches by common letter count largest to smallest

Opened this issue · 0 comments

I am noticing that some of my matches in which I have one term as a subset of another term for partial_set_token come back with the non-optimal choice. For the sort order when having ties, there needs to be a better way that is independent of the order of the data. Perhaps using total common tokens (or letters).

"Company" and "Company 1" has a score of 100
"Company 1" and "Company 1" has a score of 100
It would seem that the second pairing would be the better match.

query = 'Company 2' choices = ['Company' ,'Company 1', 'Company 2', 'Awesome Company' ] process.extractOne(query, choices, scorer= fuzz.partial_token_set_ratio)

Out[72]: ('Company', 100)
The winner always seems to be the first in the list of choices. While one could order both lists before using the functions, that could create a different kind of bias in which we would never match to the appropriate choice when the tokens are in the middle of the choice string.

Similar behavior when using the partial_token_sort_ratio scorer.