NLP Text Tagger

Natural Language Processing text tagger using nltk and Python 3.

Structure

This application contains three main Python sources.

structures.py contains the data structures used to store, process and classify the text

training.py is the training facility for the model. It implements Trainer class that must be fed with texts, and it provides a shell script to do that. See more about it below.

model.py provides facilities for text tagging based on the trained model. It defines a Model class that can classify a text, given a set of Tags.

Output format

The chosen output file format for the model is a collection of file each representing a Python object, dumped by the pickle module.

The default directory for model save files is out

Training

Training can be performed either using the provided shell script or by using the interactive Jupyter notebook file.

Using shell

Browse to the root directory of this project and call python3 training.py using these arguments:

-d or --directory is the path of the training set, divided in subdirectories where each name will be the name of the corresponding Tag.

-o or --output can be used to change the path of the output directory (i.e. where pickle will save model dump files)

-l or --language is default set to 'en' (english) but can be changed according to your corpus language. N.B. Using different languages in the same corpus will produce garbage results.

Using Jupyter notebook

You can train the model using a Jupyter notebook file, various folder paths can be specified in the notebook itself. The notebook requires more packages than the shell version: seaborn, pandas, matplotlib, numpy and jupyter

Further developments

Training tasks can be highly optimized by using threads or offloading the workload to a remote server. This can be done using different libraries such as Dusk, RPyC or Pyro. But they require a scheduler and a work division policy.