seatgeek/fuzzywuzzy

How to compare each and every row with every row in same column and delete matching rows with ratio > 90

nithinreddyy opened this issue · 0 comments

How to compare each and every row with every row in same column and delete matching rows with ratio > 90

For example i have dataframe like

Pdf                         Content             Page no
July 20, 2017.PDF           Hello               24.0
July 20, 2017.PDF           Hi                  20.0
July 2, 2018.PDF            Hey                 21.0
July 2, 2018.PDF            Helloo              10.0
July 2, 2018.PDF            Hii                 11.0

I'm exptecting output like if the each and every matches with ration above 90, then the row must be removed and the expected output is

Pdf                         Content             Page no
July 20, 2017.PDF           Hello               24.0
July 20, 2017.PDF           Hi                  20.0
July 2, 2018.PDF            Hey                 21.0

I'm trying the below code, but it's just returning the matching ratio

compare = pd.MultiIndex.from_product([data['Content'],
                                      data['Content1']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare = compare.apply(metrics)

1