This is a paradigm for corpus preprocessing, especially for the raw text directly extreacted from the Internet.
This code implementation is based on the blog Text Preprocessing for NLP (Natural Language Processing),Beginners to Master by Ujjawal Verma.
The main difference from the preprocessing in the blog is that:
- This paradigm relies on spaCy, instead of nltk.
- We removed stopwords twice, once after the pre-tokenization, once after the pipeline processing, in order to reduce the wordload and running-time of while using spaCy large pre-trained model. (Please note using transformer-based model for huge corpus could result in a very long running-time.)
- After the pipeline preprocessing, we stripped all the blank space elements in the list of tokens/lemmas.