/SogouCS-Extractor

A Python script that extracts and cleans text from a SougouCS database

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

SougouCS-Extractor

Introduction

The project uses the SougouCS as source of documents for several purposes: as training data and as source of data to be annotated.

SougouCS are available from SougouCS database download.

The SougouCS extractor tool generates plain text from a SougouCS database.

Description

extractor.py is a Python script that extracts and cleans text from a SougouCS database.

Usage:

 extractor.py [options]

Options:

 -i,         : input file dir
 -o,         : out file dir
 --help      : display this help and exit