Issue with partial_ratio_alignment

Question

Issue with partial_ratio_alignment

laphang opened this issue a year ago · 3 comments

In my example below, partial_ratio_alignment seems to cut short the matching in the 2nd string, I was expecting it to include the additional "et."

Code:
query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

match = fuzz.partial_ratio_alignment(query_string, contains_string, score_cutoff = 90)

print(match)
print(query_string[match.src_start:match.src_end], contains_string[match.dest_start:match.dest_end])

Output:
ScoreAlignment(score=94.91525423728814, src_start=0, src_end=59, dest_start=0, dest_end=59)
("Business's say they got nothing out of last night's budget.",  "Business's say they've got nothing out of last night's budg")

Answer 1 · 2023-04-27T12:16:14.000Z

partial_ratio uses a sliding window approach to find the optimal alignment of the shorter string with the longer string. So it will not find an alignment, where the subsequence in the longer string is longer than the shorter string. The subequence can be either as long as the shorter string or if it starts/ends at the start/end of the longer string can be shorter.

The metric you are searching for is Smith Waterman, which is not implemented in rapidfuzz yet: #175

Answer 2 · 2023-04-27T23:21:53.000Z

Thanks for the fast response, and also for pointing out the parasail package in the issue you linked, that seems interesting.

Answer 3 · 2023-04-28T07:07:35.000Z

FWIW, I had pretty good results with parasail. Here's an example:

query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

result = parasail.ssw(query_string, contains_string, 10, 1, parasail.blosum50) 

print(query_string[result.read_begin1:result.read_end1+1])
print(contains_string[result.ref_begin1:result.ref_end1+1])

output:
Business's say they got nothing out of last night's budget.
Business's say they've got nothing out of last night's budget.

I later used rapidfuzz again for distance / score calculations.