/nlp-cats

Opennlp classificator, which suggests categories from issue title/description

Primary LanguageClojureGNU General Public License v2.0GPL-2.0

nlp-cats

Opennlp classificator in Clojure, which suggests categories for text snippet. Source code is available on github.

Goals of the project

To create proof-of-concept tool, which allows to categorize issues based on their title and description. This tool can became a part of a big commercial product for reporting and processing problems of cities and buildings. It is important for UX, because it allows automatically to fill up fields that must be filed.

Dataset preparation

As an initial dataset part of production database was taken. It looked like list of triplets (category, title, description). After that, list was separated into two parts: first part for buildings and second for cities problems. Separation was done by SQL queries using join and where operators. Number of different categories for buildings problems is smaller and only building problems will be used for training model. Table below demonstrates all possible categories for building’s problems.

category_idcategory_titleorganization_type
1Temperaturebuildings
2Cleaningbuildings
3Locksmithbuildings
4Electricalbuildings
5Plumbingbuildings
6Paintingbuildings

Actually Painting and Cleaning categories was removed from experiment as we do not have enough samples for them. After that dataset was separated into another two parts by language. There was only two languages in dataset: French and English and separation was done using Clojure language detection library \cite{clojurelangdetect}, which is just a wrapper over language detection library for Java \cite{nakatani2010langdetect}. For training model only French language was used in this example.

Another important action is anonymization was done using regexp matching for things like badge id and part of speech tagging for finding and replacing real person names to anonymized names \cite{crf-pos-tagger}. And finally duplicates and test records was removed from dataset.

After all manipulations was obtained around 45 anonymized records in French language, which was separated into training dataset and test dataset: train.csv and test.csv, but later on it was decided to use other validation method, which will be explained in evaluation chapter.

Implementation

Implementation based on Clojure wrapper around the apache opennlp library \cite{apache-opennlp} and consist of few functions. remove-indexed used for spliting dataset into training and test parts. count-matches and count-semi-matches is reducers, which helps calculate model score.

train-and-test-model is most important part of implementation. At the first step it trains model from train-dataset, after it creates a categorizer - function, which allows to get probabilities of each category for particular text snippet. Model can be saved using train/write-model function for future use. At the second step function loops across test dataset and creates for each entry tuple of four elements: boolean value, which tells if suggested category equal to real category, real category, text snippet and probabilities of all categories. Example of such tuples provided below:

[false
 "Locksmith"
 "Monte Charge le bouton 1er étage du monte charge pour aller au local poubelle
 ne fonctionne plus Merci."
 (["Electrical" 0.2305854563819988]
  ["Locksmith" 0.21372902079373204]
  ["Plumbing" 0.17350935855753205]
  ["Temperature" 0.15122224738999251])]
[true
 "Plumbing"
 "Fuite d'eau au niveau du toit Fuite d'eau au niveau du toit suite à la pluie
 de cet après-midi.."
 (["Plumbing" 0.5268062661863027]
  ["Temperature" 0.13269958836754342]
  ["Locksmith" 0.10136806966414517]
  ["Electrical" 0.06574031705840017])]

do-cross-validation is a function, which separates dataset into two parts train-data and test-data and calculates score for model. For spliting it uses nth and remove-indexed functions to implement leave-one-out cross-validation. After splitting it writes two files and runs train-and-test-model function to produce pair of ints: number of correctly suggested categories and number of issues in test set. For leave-one-out cross-validation it is: [0 1] or [1 1]. To calculate score it uses function passed as matcher parameter.

Evaluation

First question is why leave-one-out cross-validation was used. Answer is simple: after preparation of the data number of issues decreased from 327 to 45. 45 is a very small size for dataset and common validation methods (70/30 for example) not acceptable for this case, because requires two much samples in test dataset and therefore train dataset became much smaller and quality of model decreases significantly. With such approach only one element excluded from training dataset and it allows to get better results.

reduce functionmatchestotalpercentage
count-matches274560%
count-semi-matches334573.4%

There was two function used to calculate score for the model. First one calculates how much items have most possible category equal to real category and result is pretty weak, only 60%, but on the other hand training dataset of 44 issues is really small, in this context it looks not so bad.

Other idea was to calculate how many items falls into two most possible categories, on the example with probabilities in Implementation section one item does not fall into right category, but second possible category had a pretty close probability to first one. Experiment with count-semi-matches showed that 73.4% of the issues falls into two most possible categories. It is not so much as expected by intuition behind this idea, but to know this will not be superfluous.

Conclusion

This project implements classificator, which allows to suggest categories for issue reports. It written in Clojure and can be used in any production environment, which uses jvm, models can be easily saved in .bin files, but for now model shows pretty weak 60% accuracy and probably not suitable for real world usage. Such results can be explained with size of training dataset, 45 issues is a really small number for such task. With bigger dataset and few tweaks for model features it is probably possible to get very sane results. When more data will be available, new experiment will be conducted and in case of good results this tool will be added to real project.