The project uses the SougouCS as source of documents for several purposes: as training data and as source of data to be annotated.
SougouCS are available from SougouCS database download.
The SougouCS extractor tool generates plain text from a SougouCS database.
extractor.py is a Python script that extracts and cleans text from a SougouCS database.
Usage:
extractor.py [options]
Options:
-i, : input file dir
-o, : out file dir
--help : display this help and exit