Lyonk71/pandas-dedupe

Spellcheck

Closed this issue · 3 comments

Feature request: It would be helpful if spellcheck was applied before dedupe. I'm seeing a few cases where the misspelled word becomes the canonical.

@vinnyp great idea for a new feature!

Something I thought about with the spell check is, take the most similar match (that is repeated the most) during the training phase and let the user know during the training phase in order to allow the user to override the canonical spelling.

For example:

brand, category
"ABC, LLC",...
"ABC, LLC",...
"ABc, LLC",...
"abc, LLC",...
"ABC, LLC Alphabet",...
"ABC, LLC Alphabet",...

The chosen spelling would be ABC, LLC due to it occurring the most often (lowercasing before spellcheck) and the user could override it if they wish during the training phase. The ABC, LLC occurs twice without lowercase and four times with lowercase over the exact match of ABC, LLC Alphabet occuring twice.

You could probably use a copy of the dataframe itself to do that efficiently before starting dedupe.

I wrote this comment even though this was closed because it's still in the tasks backlog and PR #34 mentions that Gazetteer helps with this issue, but doesn't mention solving it completely.

@alexis-evelyn Using the canonical=True flag should should provide the results you describe above. Feel free to re-post out if I misread your comment!