Spellcheck
Closed this issue · 3 comments
Feature request: It would be helpful if spellcheck was applied before dedupe. I'm seeing a few cases where the misspelled word becomes the canonical.
Something I thought about with the spell check is, take the most similar match (that is repeated the most) during the training phase and let the user know during the training phase in order to allow the user to override the canonical spelling.
For example:
brand, category
"ABC, LLC",...
"ABC, LLC",...
"ABc, LLC",...
"abc, LLC",...
"ABC, LLC Alphabet",...
"ABC, LLC Alphabet",...
The chosen spelling would be ABC, LLC
due to it occurring the most often (lowercasing before spellcheck) and the user could override it if they wish during the training phase. The ABC, LLC
occurs twice without lowercase and four times with lowercase over the exact match of ABC, LLC Alphabet
occuring twice.
You could probably use a copy of the dataframe itself to do that efficiently before starting dedupe.
I wrote this comment even though this was closed because it's still in the tasks backlog and PR #34 mentions that Gazetteer helps with this issue, but doesn't mention solving it completely.
@alexis-evelyn Using the canonical=True
flag should should provide the results you describe above. Feel free to re-post out if I misread your comment!