Low score when diff b/w 2 strings is large

Question

Low score when diff b/w 2 strings is large

Opened this issue 3 years ago · 3 comments

Hi,

I am using thefuzz for a fuzzy matching set of strings but I don't understand why it gives a low score for "Meta Plate form" for query meta

from thefuzz import fuzz
from thefuzz import process


choices = ["Meta Platforms Inc Class a Common stock",
           "Meta Financial Group, Inc. Common Stock",
           "Metals Acquisition Corp",
           "Metacrine, Inc. Common Stock",
           "Metalla Royalty & Streaming Ltd.",
           "Meta Materials Inc. Common Stock",
           "Metals Acquisition Corp Units, each consisting of one Class A ordinary share and one-third of one re"
           ]

res = process.extract("Meta", choices, limit=50)
print(res)

Output

[('Metals Acquisition Corp', 90), ('Metacrine, Inc. Common Stock', 90), ('Metalla Royalty & Streaming Ltd.', 90), ('Meta Materials Inc. Common Stock', 90), ('Meta Platforms Inc Class a Common stock', 60), ('Meta Financial Group, Inc. Common Stock', 60), ('Metals Acquisition Corp Units, each consisting of one Class A ordinary share and one-third of one re', 60)]

Answer 1 · 2021-12-10T11:16:07.000Z

After checking the code, I realized the algorithm also gives weightage to the size of two strings.

if I add "Meta Platforms" in the choices it found it with score 90

Answer 2 · 2021-12-16T10:30:04.000Z

@bigtoast any comments on it?

Answer 3 · 2022-07-21T09:54:54.000Z

so ideally we also face the same issue at times giving penalisation to the larger strings on tie breaker