jfilter/clean-text

Is it possible to remove punctuations but exclude cases like "drive-thru"?

Jess0-0 opened this issue · 4 comments

I'd like to remove punctuations from the text but would like to include "-".
For example, "text---cleaning" will become "text cleaning" but "drive-thru" will still be "drive-thru" after the cleaning/

Right now, this is not possible. But this seems to me a feature this package should provide. I will look into it but this may take a while.

You are mainly interested to keep hyphens in compound words, right? So other punctuation such as "." or "," should get removed.

Yes that's correct. Other punctuation such as "." or "," should get removed.

I had the same kind of scenario. I solved it like this.

from cleantext import clean

def clean_with_exceptions(text, *args, **kwargs):
    exceptions = kwargs.pop("exceptions", [])
    for idx, exp in enumerate(exceptions):
        text = text.replace(exp, "exp{}exp".format("z" * (idx + 1)))
    text = clean(text, *args, **kwargs)
    for idx, exp in enumerate(exceptions):
        text = text.replace("exp{}exp".format("z" * (idx + 1)), exp)
    return text

cleaned_text = clean_with_exceptions(
    text,
    exceptions=["-"],
    no_line_breaks=True,
    no_urls=True,  # replace all URLs with a special token
    no_emails=True,  # replace all email addresses with a special token
    no_currency_symbols=True,  # replace all currency symbols with a special token
    no_punct=True,
)

It is a bit hackish, but it worked for my case.