A Corpus Preprocessing Paradigm with spaCy

This is a paradigm for corpus preprocessing, especially for the raw text directly extreacted from the Internet.

The main difference from the preprocessing in the blog is that:

This paradigm relies on spaCy, instead of nltk.
We removed stopwords twice, once after the pre-tokenization, once after the pipeline processing, in order to reduce the wordload and running-time of while using spaCy large pre-trained model. (Please note using transformer-based model for huge corpus could result in a very long running-time.)
After the pipeline preprocessing, we stripped all the blank space elements in the list of tokens/lemmas.

wang-yiwei/corpus_preprocessing_paradigm