/tesseract-ocr-for-php

A wrapper to work with Tesseract OCR inside PHP.

Primary LanguagePHPMIT LicenseMIT

Tesseract OCR for PHP

Tesseract OCR for PHP

A wrapper to work with Tesseract OCR inside PHP.

Circle CI AppVeyor Codacy Test Coverage
Latest Stable Version Total Downloads Monthly Downloads
Join the chat Tweet

Installation

Via Composer:

$ composer require thiagoalessio/tesseract_ocr

‼️ This library depends on Tesseract OCR, version 3.03 or later.


Note for Windows users

There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.

choco install capture2text --version 3.9

⚠️ Recent versions of Capture2Text stopped shipping the tesseract binary.


Note for macOS users

With MacPorts you can install support for individual languages, like so:

$ sudo port install tesseract-<langcode>

But that is not possible with Homebrew. It comes only with English support by default, so if you intend to use it for other language, the quickest solution is to install them all:

$ brew install tesseract --with-all-languages

Usage

Basic usage

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('text.png'))
    ->run();
The quick brown fox
jumps over
the lazy dog.

Other languages

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('german.png'))
    ->lang('deu')
    ->run();
Bülowstraße

Multiple languages

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('mixed-languages.png'))
    ->lang('eng', 'jpn', 'spa')
    ->run();
I eat すし y Pollo

Inducing recognition

use thiagoalessio\TesseractOCR\TesseractOCR;
echo (new TesseractOCR('8055.png'))
    ->whitelist(range('A', 'Z'))
    ->run();
BOSS

Breaking CAPTCHAs

Yes, I know some of you might want to use this library for the noble purpose of breaking CAPTCHAs, so please take a look at this comment:

thiagoalessio#91 (comment)

API

executable

Define a custom location of the tesseract executable, if by any reason it is not present in the $PATH.

echo (new TesseractOCR('img.png'))
    ->executable('/path/to/tesseract')
    ->run();

tessdataDir

Specify a custom location for the tessdata directory.

echo (new TesseractOCR('img.png'))
    ->tessdataDir('/path')
    ->run();

userWords

Specify the location of user words file.

This is a plain text file containing a list of words that you want to be considered as a normal dictionary words by tesseract.

Useful when dealing with contents that contain technical terminology, jargon, etc.

$ cat /path/to/user-words.txt
foo
bar
echo (new TesseractOCR('img.png'))
    ->userWords('/path/to/user-words.txt')
    ->run();

userPatterns

Specify the location of user patterns file.

If the contents you are dealing with have known patterns, this option can help a lot tesseract's recognition accuracy.

$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com
echo (new TesseractOCR('img.png'))
    ->userPatterns('/path/to/user-patterns.txt')
    ->run();

lang

Define one or more languages to be used during the recognition. A complete list of available languages can be found at: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages

Tip from @daijiale: Use the combination ->lang('chi_sim', 'chi_tra') for proper recognition of Chinese.

 echo (new TesseractOCR('img.png'))
     ->lang('lang1', 'lang2', 'lang3')
     ->run();

psm

Specify the Page Segmentation Method, which instructs tesseract how to interpret the given image.

More info: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

echo (new TesseractOCR('img.png'))
    ->psm(6)
    ->run();

whitelist

This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....').

echo (new TesseractOCR('img.png'))
    ->whitelist(range('a', 'z'), range(0, 9), '-_@')
    ->run();

format

Specify an output format other than text. Available options are HOCR and TSV (TSV is only available on Tesseract 3.05+)

echo (new TesseractOCR('img.png'))
    ->format('hocr')
    ->run();

hocr

Shortcut for ->format('hocr').

echo (new TesseractOCR('img.png'))
    ->hocr()
    ->run();

tsv

Shortcut for ->format('tsv').

echo (new TesseractOCR('img.png'))
    ->tsv()
    ->run();

Other options

Any configuration option offered by Tesseract can be used like that:

echo (new TesseractOCR('img.png'))
    ->config('config_var', 'value')
    ->config('other_config_var', 'other value')
    ->run();

Or like that:

echo (new TesseractOCR('img.png'))
    ->configVar('value')
    ->otherConfigVar('other value')
    ->run();

More info: https://github.com/tesseract-ocr/tesseract/wiki/ControlParams

Where to get help

Join the chat at https://gitter.im/thiagoalessio/tesseract-ocr-for-php

How to contribute

See CONTRIBUTING.md.

License

tesseract-ocr-for-php is released under the MIT License.

Made with love in Berlin