rapidfuzz/RapidFuzz

Issue with partial_ratio_alignment

laphang opened this issue · 3 comments

In my example below, partial_ratio_alignment seems to cut short the matching in the 2nd string, I was expecting it to include the additional "et."

Code:
query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

match = fuzz.partial_ratio_alignment(query_string, contains_string, score_cutoff = 90)

print(match)
print(query_string[match.src_start:match.src_end], contains_string[match.dest_start:match.dest_end])

Output:
ScoreAlignment(score=94.91525423728814, src_start=0, src_end=59, dest_start=0, dest_end=59)
("Business's say they got nothing out of last night's budget.",  "Business's say they've got nothing out of last night's budg")

partial_ratio uses a sliding window approach to find the optimal alignment of the shorter string with the longer string. So it will not find an alignment, where the subsequence in the longer string is longer than the shorter string. The subequence can be either as long as the shorter string or if it starts/ends at the start/end of the longer string can be shorter.

The metric you are searching for is Smith Waterman, which is not implemented in rapidfuzz yet: #175

Thanks for the fast response, and also for pointing out the parasail package in the issue you linked, that seems interesting.

FWIW, I had pretty good results with parasail. Here's an example:

query_string="Business's say they got nothing out of last night's budget."
contains_string="Business's say they've got nothing out of last night's budget. It's really hard out there!"

result = parasail.ssw(query_string, contains_string, 10, 1, parasail.blosum50) 

print(query_string[result.read_begin1:result.read_end1+1])
print(contains_string[result.ref_begin1:result.ref_end1+1])

output:
Business's say they got nothing out of last night's budget.
Business's say they've got nothing out of last night's budget.

I later used rapidfuzz again for distance / score calculations.