dctool2: A Python repository from pmatigakis

dctool2 is a collection of luigi tasks that train a document text classifier.

Installation

Create and activate a virtualenv environment

virtualenv --python=python3 virtualenv
source virtualenv/bin/activate

Download and install dctool

python setup.py install

Usage

dctool2 requires some labeled documents to be stored in a file on an hdfs folder. Every line on that file must contain one json encoded object per document. The contents of the object must have the following schema.

{
    "text": "the document content",
    "category": "the document category"
}

Start the luigi scheduler

luigid --pidfile /path/to/pid/file --logdir /path/to/logs --state-path /path/to/state/file

Run the luigi tasks. The CreateClassifier task will perform a grid search to find the parameters that give the best classification result.

The following parameters must be given in the luigi.cfg file

variable	description
documents-file	the hdfs path to the training documents
output-folder	the path to store the results
categories	what categories to use in the classifier
test-size	the test set size
min-df-list	the term minimum document frequency
max-df-list	the term maximum document frequency
percentile-list	the percentile of features to keep
namenode-host	the hadoop namenode address
namenode-port	the hadoop namenode port

Start the task with the following command

luigi --module dctool2.categories.tasks CreateClassifier --workers 4

The trained classifier will be in the <output-folder>/trained_classifier/classifier.pickle file. Use scikit-learns's sklearn.externals.joblib module to load it.

The classifier evaluation will be stored in the <output-folder>/analysis folder.

Keep in mind that training can take a long time. On a laptop with an i3-3217U CPU and 8GB of RAM it took about an hour to train a classifier using a 2000 document dataset with several different parameters.

pmatigakis/dctool2

Installation

Usage