NLP Text Tagger
Natural Language Processing text tagger using nltk and Python 3.
Structure
This application contains three main Python sources.
structures.py
contains the data structures used to store,
process and classify the text
training.py
is the training facility for the model. It implements
Trainer class that must be fed with texts, and it provides a
shell script to do that. See more about it below.
model.py
provides facilities for text tagging based on the trained
model. It defines a Model class that can classify a text, given a
set of Tags.
Output format
The chosen output file format for the model is a collection of file
each representing a Python object, dumped by the pickle
module.
The default directory for model save files is out
Training
Training can be performed either using the provided shell script or by using the interactive Jupyter notebook file.
Using shell
Browse to the root directory of this project and call python3 training.py
using these arguments:
-d
or --directory
is the path of the training set, divided in subdirectories
where each name will be the name of the corresponding Tag.
-o
or --output
can be used to change the path of the output directory
(i.e. where pickle will save model dump files)
-l
or --language
is default set to 'en' (english) but can be changed
according to your corpus language.
N.B. Using different languages in the same corpus will produce garbage results.
Using Jupyter notebook
You can train the model using a Jupyter notebook file, various folder paths
can be specified in the notebook itself.
The notebook requires more packages than the shell version:
seaborn
, pandas
, matplotlib
, numpy
and jupyter
Further developments
Training tasks can be highly optimized by using threads or offloading the workload to a remote server. This can be done using different libraries such as Dusk, RPyC or Pyro. But they require a scheduler and a work division policy.