Is it possible to remove punctuations but exclude cases like "drive-thru"?
Jess0-0 opened this issue · 4 comments
Jess0-0 commented
I'd like to remove punctuations from the text but would like to include "-".
For example, "text---cleaning" will become "text cleaning" but "drive-thru" will still be "drive-thru" after the cleaning/
jfilter commented
Right now, this is not possible. But this seems to me a feature this package should provide. I will look into it but this may take a while.
jfilter commented
You are mainly interested to keep hyphens in compound words, right? So other punctuation such as "." or "," should get removed.
Jess0-0 commented
Yes that's correct. Other punctuation such as "." or "," should get removed.
tanwirahmad commented
I had the same kind of scenario. I solved it like this.
from cleantext import clean
def clean_with_exceptions(text, *args, **kwargs):
exceptions = kwargs.pop("exceptions", [])
for idx, exp in enumerate(exceptions):
text = text.replace(exp, "exp{}exp".format("z" * (idx + 1)))
text = clean(text, *args, **kwargs)
for idx, exp in enumerate(exceptions):
text = text.replace("exp{}exp".format("z" * (idx + 1)), exp)
return text
cleaned_text = clean_with_exceptions(
text,
exceptions=["-"],
no_line_breaks=True,
no_urls=True, # replace all URLs with a special token
no_emails=True, # replace all email addresses with a special token
no_currency_symbols=True, # replace all currency symbols with a special token
no_punct=True,
)
It is a bit hackish, but it worked for my case.