jbesomi/texthero

add fix_encoding to preprocessing

cedricconol opened this issue · 3 comments

I think it would be nice to have a fix_encoding function in preprocessing to fix bad encoding in input text. We can build this using ftfy.

Examples from ftfy's readme:

>>> print(fix_text('This text should be in “quotesâ€\x9d.'))
This text should be in "quotes".

>>> print(fix_text('ünicode'))
ünicode

>>> print(fix_text('Broken text… it’s flubberific!',
...                normalization='NFKC'))
Broken text... it's flubberific!

This is certainly useful, I'm just not sure how common these errors are? I assume we would not put it into the standard clean pipeline as the problem is probably not very common and running it introduces significant overhead.

Then the only case this would be used is if a user notices he has this encoding error in his Series. Would he then not just google the problem, land on StackOverflow, import ftfy and fix it himself? I guess I'm just not really seeing when a user would look for a texthero function to do this.

The only exception I can see is that maybe these errors are much more common than I think? I'm not sure.

Agree with @henrifroese.

The way we would implement this is by simply calling s.apply(fix_text). This can be done directly by the user.

@cedricconol if you believe this function might be useful for many, you can write a blog article about that subject. The idea would be to load a dataset, explain the problem, and show the code to fix the issue.

I'm closing this now as the idea is to prioritize: #85

Thanks for your feedbacks @henrifroese and @jbesomi.