seatgeek/thefuzz

[Question] get the real similarity ratio instead of integer.

a-sajjad72 opened this issue · 6 comments

i want to stay on thefuzz module because as you see the output results thefuzz provide more good result as compared to rapidfuzz. and one other thing in thefuzz is that it also provide UQRatio to allow non-ascii (unicode) characters. is it possible to float number instead of int?

below is the code using rapidfuzz python module. which gives the results in real (float) number

from rapidfuzz import fuzz as rf_fuzz
from rapidfuzz import process as rf_process
# books.title (Series) is a column of dataframe containing books titles
# s (string) is the book to search
s = "alice in the wonder land"
rf_process.extract(s,books.title, scorer=rf_fuzz.QRatio)

OUTPUT

[('New Alice in the Old Wonderland', 76.36363636363637, 9619),
 ('Palace in the Garden', 68.18181818181819, 5161),
 ('House on the Borderland', 68.08510638297872, 2486),
 ('Alice in Wonderland (Drama)', 66.66666666666667, 7303),
 ("Alice's Adventures in Wonderland", 64.28571428571428, 110)]

below is the code using thefuzz python module. which gives the results in integer (int) number.

from thefuzz import fuzz as tf_fuzz
from thefuzz import process as tf_process
# books.title (Series) is a column of dataframe containing books titles
# s (string) is the book to search
s = "alice in the wonder land"
tf_process.extract(s,books.title, scorer=tf_fuzz.QRatio)

OUTPUT

[('New Alice in the Old Wonderland', 84, 9619),
 ('Alice in Wonderland (Drama)', 76, 7303),
 ("Alice's Adventures in Wonderland", 71, 110),
 ("Alice's Abenteuer im Wunderland", 69, 3232),
 ('Palace in the Garden', 68, 5161)]

i want to stay on thefuzz module because as you see the output results thefuzz provide more good result as compared to rapidfuzz.

By default rapidfuzz does not preprocess strings anymore to be consistent across all functions. However you can easily enable preprocessing to get the same results:

>>> from rapidfuzz import fuzz as rf_fuzz
>>> from rapidfuzz import process as rf_process
>>> from rapidfuzz import utils as rf_utils
>>> s = "alice in the wonder land"
>>> choices = ['New Alice in the Old Wonderland']
>>> rf_process.extract(s, choices, scorer=rf_fuzz.QRatio)
[('New Alice in the Old Wonderland', 76.36363636363637, 0)]
>>> rf_process.extract(s,['New Alice in the Old Wonderland'], processor=rf_utils.default_process, scorer=rf_fuzz.QRatio)
[('New Alice in the Old Wonderland', 83.63636363636364, 0)]

and one other thing in thefuzz is that it also provide UQRatio to allow non-ascii (unicode) characters

force_ascii is a pretty useless feature. It filters out any characters with code points between 128 and 256. So characters from the extended ascii set are filtered out, while e.g. chinese characters are not removed. Even if it was implemented to filter out all non ascii characters, I do not see any reason why you would ever want to do this.
Since I do not think this should ever be done, I removed this behaviour in rapidfuzz. This means rapidfuzz.fuzz.QRatio is the equivalent of the_fuzz.fuzz.UQRatio. Since QRatio already supports full unicode rapidfuzz does not provide UQRatio, since it is exactly the same function.

Note that at this point the_fuzz uses rapidfuzz under the hood.

thefuzz/thefuzz/fuzz.py

Lines 21 to 32 in 681abb2

def _rapidfuzz_scorer(scorer, s1, s2, force_ascii, full_process):
"""
wrapper around rapidfuzz function to be compatible with the API of thefuzz
"""
if full_process:
if s1 is None or s2 is None:
return 0
s1 = utils.full_process(s1, force_ascii=force_ascii)
s2 = utils.full_process(s2, force_ascii=force_ascii)
return int(round(scorer(s1, s2)))

It's basically a compatibility layer which adds support for force_ascii , rounds the results to integers and handles differences in function defaults.

Note that at this point the_fuzz uses rapidfuzz under the hood.

actually i already seen this. that's how i came to know that thefuzz is using rapidfuzz.

thefuzz/thefuzz/fuzz.py

Lines 21 to 32 in 681abb2

def _rapidfuzz_scorer(scorer, s1, s2, force_ascii, full_process):
"""
wrapper around rapidfuzz function to be compatible with the API of thefuzz
"""
if full_process:
if s1 is None or s2 is None:
return 0
s1 = utils.full_process(s1, force_ascii=force_ascii)
s2 = utils.full_process(s2, force_ascii=force_ascii)
return int(round(scorer(s1, s2)))

i also tried to remove the int(round()) around the scorer(s1, s2) but didn't worked. i did wrong???

It's basically a compatibility layer which adds support for force_ascii , rounds the results to integers and handles differences in function defaults.

isn't there any way to get rid of the rounding of results to integers thing.

isn't there any way to get rid of the rounding of results to integers thing.

It's not particularly clear to me why you want to use a compatibility wrapper, but would prefer to have the behaviour of the wrapped library 😕
At least right now you can't get the float value in thefuzz, since the score returned by rapidfuzz is rounded in int(round(scorer(s1, s2))) . Without editing the source code you can't get rid of this.

i also tried to remove the int(round()) around the scorer(s1, s2) but didn't worked. i did wrong???

you will have to do the same in process.py

ok i use rapidfuzz for that.