Tesseract OCR for PHP

A wrapper to work with Tesseract OCR inside PHP.

Installation

First of all, make sure you have Tesseract OCR installed. (v3.03 or greater)

As a composer dependency

{
    "require": {
        "thiagoalessio/tesseract_ocr": "1.1.0"
    }
}

Usage

Basic usage

Given the following image (text.jpeg):

And the following code:

<?php
echo (new TesseractOCR('text.png'))
    ->run();

The output would be:

The quick brown fox
jumps over the lazy
dog.

Other languages

Given the following image (german.jpeg):

And the following code:

<?php
echo (new TesseractOCR('german.png'))
    ->run();

The output would be:

griiﬁen

Which is not good, but defining a language:

<?php
echo (new TesseractOCR('german.png'))
    ->lang('deu')
    ->run();

Will produce:

grüßen

Multiple languages

Given the following image (multi-languages.jpeg):

And the following code ....

<?php
echo (new TesseractOCR('multi-languages.png'))
    ->lang('eng', 'jpn', 'por')
    ->run();

The output would be:

I eat 寿司 de maçã

Inducing recognition

Given the following image (8055.png):

And the following code ....

<?php
echo (new TesseractOCR('8055.png'))
    ->whitelist(range('A', 'Z'))
    ->run();

The output would be:

BOSS

Quiet Mode

To clean the bash log console you can use the Quiet Mode configuration. The following code:

<?php
echo (new TesseractOCR('text.png'))
    ->quietMode(true)
    ->run();

This way you can get clean logs.

Debugging

You can just retrieve the generated tesseract command instead of running it:

<?php
echo (new TesseractOCR('image.png'))
    ->executable('/usr/local/bin/tesseract')
    ->lang('eng', 'jpn', 'por')
    ->psm(8)
    ->quietMode(true)
    ->buildCommand();

Will return:

/usr/local/bin/tesseract 'image.png' stdout -l eng+jpn+por -psm 8 quiet

API

`->executable('/path/to/tesseract')`

Define a custom location of the tesseract executable, if by any reason it is not present in the $PATH.

`->tessdataDir('/path')`

Specify a custom location for the tessdata directory.

`->userWords('/path/to/user-words.txt')`

Specify the location of user words file.

This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by tesseract.

Useful when dealing with contents that contain technical terminology, jargon, etc.

Example of a user words file:

$ cat /path/to/user-words.txt
foo
bar

`->userPatterns('/path/to/user-patterns.txt')`

Specify the location of user patterns file.

If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.

Example of a user patterns file:

$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com

`->lang('lang1', 'lang2', 'lang3')`

Define one or more languages to be used during the recognition. A complete list of available languages can be found here.

Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra') for proper recognition of Chinese.

`->psm(6)`

Specify the Page Segmentation Mode, which instructs tesseract how to interpret the given image.

Possible psm values are:

 0 = Orientation and script detection (OSD) only.
 1 = Automatic page segmentation with OSD.
 2 = Automatic page segmentation, but no OSD, or OCR.
 3 = Fully automatic page segmentation, but no OSD. (Default)
 4 = Assume a single column of text of variable sizes.
 5 = Assume a single uniform block of vertically aligned text.
 6 = Assume a single uniform block of text.
 7 = Treat the image as a single text line.
 8 = Treat the image as a single word.
 9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

`->config('configvar', 'value')`

Tesseract offers incredible control to the user through its 660 configuration vars.

You can see the complete list by running the following command:

$ tesseract --print-parameters
Tesseract parameters:
... long list with all parameters ...

`->whitelist(range('a', 'z'), range(0, 9), '-_@')`

This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....').

Where to get help

#tesseract-ocr-for-php on freenode IRC

License

Apache License 2.0.

compleatguru/tesseract-ocr-for-php

Tesseract OCR for PHP

Installation

As a composer dependency

Usage

Basic usage

Other languages

Multiple languages

Inducing recognition

Quiet Mode

Debugging

API

->executable('/path/to/tesseract')

->tessdataDir('/path')

->userWords('/path/to/user-words.txt')

->userPatterns('/path/to/user-patterns.txt')

->lang('lang1', 'lang2', 'lang3')

->psm(6)

->config('configvar', 'value')

->whitelist(range('a', 'z'), range(0, 9), '-_@')

Where to get help

License

`->executable('/path/to/tesseract')`

`->tessdataDir('/path')`

`->userWords('/path/to/user-words.txt')`

`->userPatterns('/path/to/user-patterns.txt')`

`->lang('lang1', 'lang2', 'lang3')`

`->psm(6)`

`->config('configvar', 'value')`

`->whitelist(range('a', 'z'), range(0, 9), '-_@')`