seatgeek/thefuzz

Low score when diff b/w 2 strings is large

Opened this issue · 3 comments

Zaky7 commented

Hi,

I am using thefuzz for a fuzzy matching set of strings but I don't understand why it gives a low score for "Meta Plate form" for query meta

from thefuzz import fuzz
from thefuzz import process


choices = ["Meta Platforms Inc Class a Common stock",
           "Meta Financial Group, Inc. Common Stock",
           "Metals Acquisition Corp",
           "Metacrine, Inc. Common Stock",
           "Metalla Royalty & Streaming Ltd.",
           "Meta Materials Inc. Common Stock",
           "Metals Acquisition Corp Units, each consisting of one Class A ordinary share and one-third of one re"
           ]

res = process.extract("Meta", choices, limit=50)
print(res)

Output

[('Metals Acquisition Corp', 90), ('Metacrine, Inc. Common Stock', 90), ('Metalla Royalty & Streaming Ltd.', 90), ('Meta Materials Inc. Common Stock', 90), ('Meta Platforms Inc Class a Common stock', 60), ('Meta Financial Group, Inc. Common Stock', 60), ('Metals Acquisition Corp Units, each consisting of one Class A ordinary share and one-third of one re', 60)]
Zaky7 commented

After checking the code, I realized the algorithm also gives weightage to the size of two strings.

if I add "Meta Platforms" in the choices it found it with score 90

Zaky7 commented

@bigtoast any comments on it?

so ideally we also face the same issue at times giving penalisation to the larger strings on tie breaker