Inconsistent results for token_ratio between 2.15.1 and 3.0.0
alonshalita opened this issue · 3 comments
Hi,
token_ratio returns inconsistent results when migrating from 2.15.1 to 3.0.0 (or later releases). See for example
Python 3.11.2 (main, Mar 24 2023, 00:28:48) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rapidfuzz
>>> rapidfuzz.__version__
'2.15.1'
>>> rapidfuzz.fuzz.token_ratio(
... "did lincoln. sin the national, banking act of 1863?",
... "Did Lincoln sign the National Banking Act of 1863?")
98.96907216494846
and
Python 3.11.2 (main, Mar 24 2023, 00:28:48) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rapidfuzz
>>> rapidfuzz.__version__
'3.0.0'
>>> rapidfuzz.fuzz.token_ratio(
... "did lincoln. sin the national, banking act of 1863?",
... "Did Lincoln sign the National Banking Act of 1863?")
87.12871287128714
Is this a bug, or an expected change? I couldn't seem to find anything related in the changelog.
In version 3.0.0 all scorers use processor=None
as default to make the default more consistent. Previously some of them did use processor=utils.default_process
. This changes the results in your case since the strings are no longer preprocessed. You can manually reenable the preprocessing:
>>> import rapidfuzz
>>> rapidfuzz.__version__
'3.1.0'
>>> rapidfuzz.fuzz.token_ratio(
... "did lincoln. sin the national, banking act of 1863?",
... "Did Lincoln sign the National Banking Act of 1863?",
... processor=rapidfuzz.utils.default_process)
98.96907216494846
In the changelog this is mentioned as:
update defaults of the processor argument to be None everywhere. This changes the defaults of some of
the functions in rapidfuzz.fuzz and rapidfuzz.process.
Thanks for the clarification. Can you tell which functions had their default processor changed?
I updated the changelog to mention this might change the results, how to get back the old behaviour and which functions are affected: https://github.com/maxbachmann/RapidFuzz/releases/tag/v3.0.0.
Affected function are:
process.extract
,process.extract_iter
,process.extractOne
fuzz.token_sort_ratio
,fuzz.token_set_ratio
,fuzz.token_ratio
,fuzz.partial_token_sort_ratio
,fuzz.partial_token_set_ratio
,fuzz.partial_token_ratio
,fuzz.WRatio
,fuzz.QRatio