Documer

Bayes algorithm implementation in PHP for auto document classification.

Concept

every document has key words e.g. Margaret Thatcher

every document has a label e.g. Politics

Suppose, that in every document there are key words all starting with an uppercase letter. We store these words in our DB end every time we need to guess a document against a particular label, we use Bayes algorithm.

Let's clear that out:

Training:

First, we tokenize the document and keep only our key words (All words starting with an uppercase letter) in an array. We store that array in our DB.

Guessing:

This is very simple. Again, we parse the document we want to be classified and create an array with the key words. Here is the pseudo code:

for every label in DB
	for every key word in document
		P(label/word) = P(word/label)P(label) /	( P(word/label)P(label) + (1 - P(word/label))(1 - P(label)) )

Usage

Documer uses Spot2 to store it's knowledge. Spot2 supports MySQL/SQLite.

Install through composer

"require": {
    "kbariotis/documer": "dev-master"
  },

Instantiate

Pass a Spot object with your configuration to getInstance.

$cfg = new \Spot\Config();
$cfg->addConnection('mysql', 'mysql://user:password@localhost/documer');
$spot = new \Spot\Locator($cfg);

$documer = Documer\Documer::getInstance($spot);

Train

$documer->train("politics", "A big and long text about a political Act and a Famous Person");

Guess

$scores = $documer->guess("And an other big and long text about a political Act and a Famous Person");

$scores will hold an array with all labels of your system and the posibbility which the document will belong to each label.

jwentworth/documer

Documer

Concept

Usage