classifier: An Erlang repository from amilkr

Overview

Classifier is a bayesian text analyzer and classifier. Its goal is provide a simple way to decide if a given text is considered legit or spam, based on a sample pool of texts you provide to the program.

Its logic is based upon Paul Graham's A Plan For Spam.

Features

You can tweak the parameters.
You can manually flag spams and false positives.
The program learns overtime from new texts and updates the sample pools.

Configuration

you need to define classifier as a rebar dep or have some other way of including it in erlang’s path.

To configure classifier you use an application variable (probably in your app.config):

{classifier, [  
  {update_probabilities_timeout, 300000}, %% milliseconds  
  {default_probability, 0.4},  
  {threshold_probability, 0.9},  
  {max_text_tokens, 5},  
  {minimun_appearances, 5}  
]}

All the config params have a default value, so you can skip some or all of them in your config

Usage

First of all you need to start the app:

application:start(classifier)

The next step is training the classifier. You can train it whenever you want and as many times as you want. You need to train it before the first time you start using your app.
There're three ways to train it:
- Passing a dir
```
classifier:train(Dir)
```
Where Dir is a path to some folder that contains two folders called pos and neg where there're files with texts to be analyzed. You can find an example in priv/test dir.
- Passing a text
```
classifier:train({Tag, text, Text})
```
Where Tag is 'pos' or 'neg' and Text is a string to be putted on the Tag side.
- Passing a text list
```
classifier:train({Tag, text_list, Texts})
```
Where Tag is 'pos' or 'neg' and Texts is a list of strings to be pushed on the Tag side.
Now you can ask the classifier to analyze and classify some text:

1> classifier:classify(Text).
acceptable
2> classifier:classify(AnotherText).
unacceptable

Every time the classifier classify a text it learns about the result pushing the text analyzed on its pool

Future Features

Persist the info
Multiprocess to classify text and to update the state

amilkr/classifier

Overview

Features

Configuration

Usage

Future Features