Partial_Ratio not working
aW3st opened this issue · 5 comments
Having some weird issues using partial ratio. Here's the code:
test_string = ('completed transactions settlement date trade date '
'symbol name transaction type account type quantity price commissions & fees amount '
'12/23 12/23 dividend '
'appreciation etf dividend - - - $441.99 12/23 12/23 '
'vig dividend appreciation etf reinvestment cash')
'etf' in test_string # returns True
fuzz.partial_ratio('etf', test_string)
without python-levenshtein this returns 33, with python levenshtein 67. My understanding of the method is that it should be 100, since there's a substring that's a perfect match. Any ideas?
(on python 3.8, btw)
I'm having the same issue, I would also expect a score of 100 with the below function
>>> artists_a
'carvar & clock'
>>> artists_b
'carvar clock'
>>> fuzz.partial_ratio(artists_a, artists_b)
83
>>> fuzz.partial_ratio(artists_b, artists_a)
83
I also tried without python-Levenshtein
as suggested in #79 but exact same result.
Possibly replace partial_ratio
with partial_token_sort_ratio
, as mentioned on this stackoverflow answer. In both our examples it seemed to work as expected.
partial_ratio searches for the best alignment between two strings and the calculates the fuzz.ratio
for this alignment. So while in @aW3st case the word 'etf' is part of the second string therefore you would expect the result 100, thats not the case in your example @XDGFX.
When comparing 'carvar & clock' and 'carvar clock' they are no substring of each other. However when using partial_token_sort_ratio
it works since it resorts the words to 'carvar clock &' and 'carvar clock'. So afterwards 'carvar clock' is a substring of 'carvar clock &' ;)
@aW3st you tried both with python-Levenshtein and without and both have wrong results for different reasons.
- Python-Levenshtein has a known bug with finding the optimal alignment between strings, which is probably the bug your encountering here aswell. You can find this here: #79 (comment)
- when not using python-Levenshtein fuzzywuzzy falls back to difflib. Here the problem appears to occur when using the automatic junk heuristic of difflib which is activated by default. So it would be required to change
Line 46 in 2188520
to
m = SequenceMatcher(None, shorter, longer, False)
As a sidenote my library rapidfuzz provides the same string matching algorithm without this problem, so your example string returns a score of 100 as you expected
Thanks Max, I'll give your library a shot!
@maxbachmann Hi Max, I'm working with @aW3st on a project. We've swapped fuzzywuzzy for your library, and we're seeing great performance. Thanks!