Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more rules were added.
pip install -U ternaus_cleantext
Cleans text similar, but stricter than in the CLIP model:
- Escapes HTML characters
- Removes html tags
- Removes URLs
- Removes extra white spaces
- Text to lower case
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com <b>bold</b>"))
returns
this is a test bold