Training and using classifiers for textual documents
pyTextClassification is a simple python library that can be used to train and use text classifiers. It can be trained using a corpus of text documents organized in folders, each folder corresponding to a different content class.
- pip dependencies:
pip install numpy matplotlib scipy sklearn nltk
- [pyAudioAnalysis] (https://github.com/tyiannak/pyAudioAnalysis) used for training and evaluating classifiers
In order to train a classifier based on a dataset, the following command must be used:
python textClassification.py trainFromDirs -i <datasetPath> --method <svm or knn or randomforest or gradientboosting or extratrees> --methodname <modelFileName>
<datasetPath>
is the path of the training corpus. This path must contain a list of folders, each one corresponding to a different content class. Each folder contains a list of filenames (no extension assumed) which correspond to documents belonging to this class
<modelFileName>
is the path where the extracted model is stored
Feature extraction is done using a set of predefined (static) dictionaries, stored in the myDicts/
folder. For each dictionary, a separate feature value is extracted.
Example:
python textClassification.py trainFromDirs -i moviePlotsSmall/ --method svm --methodname svmMoviesPlot7Classes
Given a trained model, and an unknown document, the following command syntax is used to classify the document:
python textClassification.py classifyFile -i <pathToUnknownDocument> --methodname <modelFileName>
This repository already contains a trained SVM model (svmMoviesPlot7Classes
) that discriminates between 7 classes of movie plots. The files samples/sample_pulpFiction
, samples/sample_forestgump
and samples/sample_lordoftherings
contain three plot examples that can be used as unknown documents for testing.
In order to classify these three files using svmMoviesPlot7Classes
, the following command must be executed:
python textClassification.py classifyFile -i samples/sample_pulpFiction --methodname svmMoviesPlot7Classes
python textClassification.py classifyFile -i samples/sample_forestgump --methodname svmMoviesPlot7Classes
python textClassification.py classifyFile -i samples/sample_lordoftherings --methodname svmMoviesPlot7Classes
The above examples return the most dominant content classes along with the respective normalized probabilities (sorted from highest to lowest).