text-preprocess

Python package for natural language pre-processing with nltk and Hunspell.

Includes:

Standardizing cases
Standardizing symbols
Removing extra whitespaces
Stopwords removal
Simple spelling corrections
Lemmatization

Available utilities:

clean_cases
split_camel_cased
clean_invalid_symbols
clean_repeated_symbols
clean_spaces
remove_stopwords
fix_spelling
SpellChecker
lemmatize
clean
soft_clean
full_clean

Supported languages:

Spanish
English

Submodules

Spell checking functions rely on dictionary files, placed by default on the dictionaries directory. This collection of dictionaries was added as a git submodule for convenience.

Lemmatization in Spanish relies on lemma dictionary files, placed by default on the lemmas directory. This collection was added as a git submodule for convenience. Feel free to propose your own!

To clone all submodules, use the following commands.

git submodule init
git submodule update

Further reference can be found here.

Setup

The stopwords and wordnet corpus for the nltk package must be installed. A helper script is provided for easy setup. Simply run:

python setup.py

Sample usage

from textpreprocess.compound_cleaners.en import full_clean, soft_clean

text = '   thiss is a bery :''{ñdirti text!  '

full_clean(text) # -> 'this very dirt text'

soft_clean(text) # -> 'this is a very dirty text'

Special thanks to Vicente Oyanedel M. for his work on the first version of this package.

fcoclavero/textpreprocess

text-preprocess

Submodules

Setup

Sample usage