/tesseract-ocr-for-php

A wrapper to work with Tesseract OCR inside PHP.

Primary LanguagePHPOtherNOASSERTION

Tesseract OCR for PHP logo: A baby elephant sucking letters from a book

Tesseract OCR for PHP

A wrapper to work with Tesseract OCR inside PHP.

Total Downloads Build Status Code Climate Test Coverage

Installation

First of all, make sure you have Tesseract OCR installed. (v3.03 or greater)

As a composer dependency

{
    "require": {
        "thiagoalessio/tesseract_ocr": "1.1.0"
    }
}

Usage

Basic usage

Given the following image (text.jpeg):

The quick brown fox jumps over the lazy dog

And the following code:

<?php
echo (new TesseractOCR('text.png'))
    ->run();

The output would be:

The quick brown fox
jumps over the lazy
dog.

Other languages

Given the following image (german.jpeg):

grüßen - Google Translate said it means "to greet" in German

And the following code:

<?php
echo (new TesseractOCR('german.png'))
    ->run();

The output would be:

griifien

Which is not good, but defining a language:

<?php
echo (new TesseractOCR('german.png'))
    ->lang('deu')
    ->run();

Will produce:

grüßen

Multiple languages

Given the following image (multi-languages.jpeg):

The phrase "I each apple sushi", with mixed English, Japanese and Portuguese

And the following code ....

<?php
echo (new TesseractOCR('multi-languages.png'))
    ->lang('eng', 'jpn', 'por')
    ->run();

The output would be:

I eat 寿司 de maçã

Inducing recognition

Given the following image (8055.png):

Number 8055

And the following code ....

<?php
echo (new TesseractOCR('8055.png'))
    ->whitelist(range('A', 'Z'))
    ->run();

The output would be:

BOSS

Quiet Mode

To clean the bash log console you can use the Quiet Mode configuration. The following code:

<?php
echo (new TesseractOCR('text.png'))
    ->quietMode(true)
    ->run();

This way you can get clean logs.

Debugging

You can just retrieve the generated tesseract command instead of running it:

<?php
echo (new TesseractOCR('image.png'))
    ->executable('/usr/local/bin/tesseract')
    ->lang('eng', 'jpn', 'por')
    ->psm(8)
    ->quietMode(true)
    ->buildCommand();

Will return:

/usr/local/bin/tesseract 'image.png' stdout -l eng+jpn+por -psm 8 quiet

API

->executable('/path/to/tesseract')

Define a custom location of the tesseract executable, if by any reason it is not present in the $PATH.

->tessdataDir('/path')

Specify a custom location for the tessdata directory.

->userWords('/path/to/user-words.txt')

Specify the location of user words file.

This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by tesseract.

Useful when dealing with contents that contain technical terminology, jargon, etc.

Example of a user words file:

$ cat /path/to/user-words.txt
foo
bar

->userPatterns('/path/to/user-patterns.txt')

Specify the location of user patterns file.

If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.

Example of a user patterns file:

$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com

->lang('lang1', 'lang2', 'lang3')

Define one or more languages to be used during the recognition. A complete list of available languages can be found here.

Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra') for proper recognition of Chinese.

->psm(6)

Specify the Page Segmentation Mode, which instructs tesseract how to interpret the given image.

Possible psm values are:

 0 = Orientation and script detection (OSD) only.
 1 = Automatic page segmentation with OSD.
 2 = Automatic page segmentation, but no OSD, or OCR.
 3 = Fully automatic page segmentation, but no OSD. (Default)
 4 = Assume a single column of text of variable sizes.
 5 = Assume a single uniform block of vertically aligned text.
 6 = Assume a single uniform block of text.
 7 = Treat the image as a single text line.
 8 = Treat the image as a single word.
 9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

->config('configvar', 'value')

Tesseract offers incredible control to the user through its 660 configuration vars.

You can see the complete list by running the following command:

$ tesseract --print-parameters
Tesseract parameters:
... long list with all parameters ...

->whitelist(range('a', 'z'), range(0, 9), '-_@')

This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....').

Where to get help

  • #tesseract-ocr-for-php on freenode IRC

License

Apache License 2.0.