tessy
tessy is a Python wrapper for Google's Tesseract-OCR, an optical character recognition engine used to detect and extract text data from various image file formats.
Features
- No initial dependencies beside Tesseract.
- Supports input image in
PNG
,JPG
,JPEG
,GIF
,TIF
andBMP
format. - Supports multiple input images via text file (.txt).
- Supports image objects from:
- Dynamically detect and import the corresponding image module on runtime.
- Supports
txt
,box
,pdf
,hocr
,tsv
andosd
as output file format. - Supports multiple output format.
- Can convert any raw output data to
string
,bytes
ordict
(except pdf). - Works on macOS, Linux and Windows.
- Well documented.
Installation
Prerequisites
- Python 3.4+
- Google's Tesseract-OCR 3.5.x+ (5.0.x+ on Windows recommended)
Installing Tesseract
Tesseract comes in two parts: The engine itself and the training data for each supported language.
>Installation on macOS (via Homebrew)
- Install both Tesseract and the training data:
brew install tesseract
>Installation on Linux (Debian/Ubuntu)
The package is generally called tesseract
or tesseract-ocr
.
- Install Tesseract:
sudo apt-get install tesseract-ocr
- Install an additional language:
sudo apt-get install tesseract-ocr-<langcode>
- (example) Install the Finish language:
sudo apt-get install tesseract-ocr-fin
- You can also install all languages at once by running:
sudo apt-get install tesseract-ocr-all
It is strongly recommended to browse the Tesseract-OCR's wiki to get more informations about other Linux distributions and languages installation.
>Installation on Windows
Both 32bit and 64bit installers for Windows are available from Tesseract at UB Mannheim (version 5.0.x+ recommended).
>Post-install it is strongly recommended to:
- Makes sure the
tesseract
command is invokable.- This is generally not the case on Windows where you have to add the tesseract
installation directory to your
PATH
.
- This is generally not the case on Windows where you have to add the tesseract
installation directory to your
- Sets the
TESSDATA_PREFIX
environment variable pointing to yourtessdata
directory (<tesseract_dir>\tessdata
on Windows, variable on macOS/Linux).
Installing tessy
>Install the PyPI package:
sudo pip install tessy
>or clone the repository:
git clone https://github.com/k4rian/tessy
Basic Usage
import tessy
# Inits the module (optional - see the examples)
tessy.init()
# Path to an image
image = "/home/tess/images/image-1.png"
# Gets the text from the image
text = tessy.image_to_string(image)
# Shows the text in the console
print(text)
Check out the examples for more advanced usages and the documentation to see what features are available.
API
Table of Contents
- tessy.Lang
- tessy.command
- tessy.set_command
- tessy.data_dir
- tessy.set_data_dir
- tessy.content_sep
- tessy.set_content_sep
- tessy.configure
- tessy.init
- tessy.image_to_file
- tessy.image_to_data
- tessy.image_to_string
- tessy.locate
- tessy.locate_data
- tessy.run
- tessy.runnable
- tessy.start (alias)
- tessy.tesseract_version
- tessy.clear_cache
- tessy.clear_temp
- tessy.sv_to_dict
- tessy.boxes_to_dict
- tessy.hocr_to_dict
- tessy.osd_to_dict
tessy.Lang
TessyLang
enumerates all supported languages by Tesseract
.
This class can only be accessed through the tessy.Lang
variable.
TessyLang.all()
Returns a key-value pair list of all available languages.
TessyLang.contains(value)
Returns True
if the given value equals any existing language value.
TessyLang.join(*args)
Returns an string in which each given languages have been joined by
the +
separator.
TessyLang.print_all()
Prints the key-value pairs of all available languages in the console.
TessyLang.AFRIKAANS
= "afr"
TessyLang.AMHARIC
= "amh"
TessyLang.ARABIC
= "ara"
TessyLang.ASSAMESE
= "asm"
TessyLang.AZERBAIJANI
= "aze"
TessyLang.AZERBAIJANI_CYRILLIC
= "aze_cyrl"
TessyLang.BELARUSIAN
= "bel"
TessyLang.BENGALI
= "ben"
TessyLang.TIBETAN
= "bod"
TessyLang.BOSNIAN
= "bos"
TessyLang.BULGARIAN
= "bul"
TessyLang.CATALAN_VALENCIAN
= "cat"
TessyLang.CEBUANO
= "ceb"
TessyLang.CZECH
= "ces"
TessyLang.CHINESE_SIMPLIFIED
= "chi_sim"
TessyLang.CHINESE_TRADITIONAL
= "chi_tra"
TessyLang.CHEROKEE
= "chr"
TessyLang.WELSH
= "cym"
TessyLang.DANISH
= "dan"
TessyLang.GERMAN
= "deu"
TessyLang.DZONGKHA
= "dzo"
TessyLang.GREEK_MODERN
= "ell"
TessyLang.ENGLISH
= "eng"
TessyLang.ENGLISH_MIDDLE
= "enm"
TessyLang.ESPERANTO
= "epo"
TessyLang.ESTONIAN
= "est"
TessyLang.BASQUE
= "eus"
TessyLang.PERSIAN
= "fas"
TessyLang.FINNISH
= "fin"
TessyLang.FRENCH
= "fra"
TessyLang.FRANKISH
= "frk"
TessyLang.FRENCH_MIDDLE
= "frm"
TessyLang.IRISH
= "gle"
TessyLang.GALICIAN
= "glg"
TessyLang.GREEK_ANCIENT
= "grc"
TessyLang.GUJARATI
= "guj"
TessyLang.HAITIAN
= "hat"
TessyLang.HAITIAN_CREOLE
= "hat"
TessyLang.HEBREW
= "heb"
TessyLang.HINDI
= "hin"
TessyLang.CROATIAN
= "hrv"
TessyLang.HUNGARIAN
= "hun"
TessyLang.INUKTITUT
= "iku"
TessyLang.INDONESIAN
= "ind"
TessyLang.ICELANDIC
= "isl"
TessyLang.ITALIAN
= "ita"
TessyLang.ITALIAN_OLD
= "ita_old"
TessyLang.JAVANESE
= "jav"
TessyLang.JAPANESE
= "jpn"
TessyLang.KANNADA
= "kan"
TessyLang.GEORGIAN
= "kat"
TessyLang.GEORGIAN_OLD
= "kat_old"
TessyLang.KAZAKH
= "kaz"
TessyLang.CENTRAL_KHMER
= "khm"
TessyLang.KIRGHIZ
= "kir"
TessyLang.KYRGYZ
= "kir"
TessyLang.KOREAN
= "kor"
TessyLang.KURDISH
= "kur"
TessyLang.LAO
= "lao"
TessyLang.LATIN
= "lat"
TessyLang.LATVIAN
= "lav"
TessyLang.LITHUANIAN
= "lit"
TessyLang.MALAYALAM
= "mal"
TessyLang.MARATHI
= "mar"
TessyLang.MACEDONIAN
= "mkd"
TessyLang.MALTESE
= "mlt"
TessyLang.MALAY
= "msa"
TessyLang.BURMESE
= "mya"
TessyLang.NEPALI
= "nep"
TessyLang.DUTCH
= "nld"
TessyLang.FLEMISH
= "nld"
TessyLang.NORWEGIAN
= "nor"
TessyLang.ORIYA
= "ori"
TessyLang.PANJABI
= "pan"
TessyLang.PUNJABI
= "pan"
TessyLang.POLISH
= "pol"
TessyLang.PORTUGUESE
= "por"
TessyLang.PUSHTO
= "pus"
TessyLang.PASHTO
= "pus"
TessyLang.ROMANIAN
= "ron"
TessyLang.MOLDAVIAN
= "ron"
TessyLang.MOLDOVAN
= "ron"
TessyLang.RUSSIAN
= "rus"
TessyLang.SANSKRIT
= "san"
TessyLang.SINHALA
= "sin"
TessyLang.SINHALESE
= "sin"
TessyLang.SLOVAK
= "slk"
TessyLang.SLOVENIAN
= "slv"
TessyLang.SPANISH
= "spa"
TessyLang.CASTILIAN
= "spa"
TessyLang.SPANISH_OLD
= "spa_old"
TessyLang.CASTILIAN_OLD
= "spa_old"
TessyLang.ALBANIAN
= "sqi"
TessyLang.SERBIAN
= "srp"
TessyLang.SERBIAN_LATIN
= "srp_latn"
TessyLang.SWAHILI
= "swa"
TessyLang.SWEDISH
= "swe"
TessyLang.SYRIAC
= "syr"
TessyLang.TAMIL
= "tam"
TessyLang.TELUGU
= "tel"
TessyLang.TAJIK
= "tgk"
TessyLang.TAGALOG
= "tgl"
TessyLang.THAI
= "tha"
TessyLang.TIGRINYA
= "tir"
TessyLang.TURKISH
= "tur"
TessyLang.UIGHUR
= "uig"
TessyLang.UYGHUR
= "uig"
TessyLang.UKRAINIAN
= "ukr"
TessyLang.URDU
= "urd"
TessyLang.UZBEK
= "uzb"
TessyLang.UZBEK_CYRILLIC
= "uzb_cyrl"
TessyLang.VIETNAMESE
= "vie"
TessyLang.YIDDISH
= "yid"
tessy.command()
Returns the Tesseract
command.
Returned value can be either the command itself or the binary path.
Default: tesseract
tessy.set_command(cmd, check_runnable=False, write_cache=False)
Sets the Tesseract
command.
If
check_runnable
is set toTrue
, the function will check if the given command is runnable by starting a new process.
If
write_cache
is set toTrue
, the given command will be stored in a special file located in the temporary directory to helpstessy
to locate Tesseract when the init function is called.
tessy.data_dir()
Returns the Tesseract
data directory.
Default: None
tessy.set_data_dir(datadir, update_env=True)
Sets the location of the Tesseract
data directory.
datadir
must be an absolute path.
If
update_env
is set toTrue
, theTESSDATA_PREFIX
environment variable will be set usingdatadir
's value.
tessy.content_sep()
Returns the content separator.
The content separator is used as delimiter when multiple string are joined
in the functions image_to_string and image_to_data.
Default: ||||
tessy.set_content_sep(sep)
Sets the content separator.
tessy.configure(**kw)
All-in-one function to set the command
, the data directory
or/and the
content separator
.
Supported keywords: command
, data_dir
, content_sep
Example usage:
configure(command="tesseract", content_sep="~")
Example usage with packed parameters as list/tuple:
configure(command=("tesseract", True), data_dir=("/home/tess/data", True))
See set_command, set_data_dir and set_content_sep functions documentation for more details about the parameters.
tessy.init()
Inits the module by performing some verifications.
- Checks if the
Tesseract
command is valid by trying to start a new process using therunnable
function. If the command fails, it will try to locate theTesseract
binary by calling the locate function. - Checks if the
Tesseract
data directory has been set by calling the locate_data function.
tessy.image_to_file(image, output_filename_base=None, output_format='txt', lang=None, config=None)
Extracts any text from the given image and return a list containing a unique file name for each specified format.
image
can be either:
- an
string
containing the absolute path to an image or a text file containing multiple absolute image paths.- an Pillow
Image
.- an wxPython
Image
.- an PyQt4/PyQt5/PySide
QImage
.- an OpenCV
NumPy ndarray
.
output_filename_base
may contain the file name without extension used as reference for any output file generated by Tesseract.e.g.: If
/home/tess/myimage
is given, the/home/tess
directory may contain the filesmyimage.txt
,myimage.tsv
, etc.If
output_filename_base
is set toNone
, all file(s) will be using the same random name as reference and will be saved in the OS's temporary directory.
output_format
contains one or more output format(s) who will be processed byTesseract
.Supported formats:
txt
,box
,hocr
,tsv
,osd
output_format
can be either:
- an
string
containing one or more format(s) delimited by a comma,
- a
list
/tuple
ofstring
e.g.:
"txt"
,"txt, tsv, box"
,("pdf", "hocr")
,["txt", "box"]
Note: If the
osd
format is present,Tesseract
will only process this format and, thus, return a single file even if multiple formats are provided inoutput_format
.
lang
may contain one or more supported language(s).
lang
can be either:
- an
string
containing one or more languages delimited by a plus sign+
- a
list
/tuple
ofstring
- a
TessyLang
enum- a
list
/tuple
ofTessyLang
enumse.g.:
"deu"
,"eng+fra"
,["eng", "fra", "deu"]
,tessy.Lang.CZECH
,[tessy.Lang.DANISH, tessy.Lang.TURKISH]
If
lang
is set toNone
,Tesseract
will process the image using the English language value ("eng"
) as default.Check the TessyLang class documentation to get the list of all supported languages.
config
may contain extra parameter(s) added to theTesseract
command.
config
must be an string and each parameter must be delimited by a space.e.g.:
"--oem 0 --psm 6"
tessy.image_to_data(image, output_format='txt', data_output='str', lang=None, config=None)
Extracts any text from the given image and return a list containing the converted data for each specified format.
data_output
specifies the type of output data to be returned for each given format inoutput_format
.Supported values:
DataOutput.BYTES
,DataOutput.STRING
,DataOutput.DICT
Default:
DataOutput.STRING
See image_to_file documentation for more details about
image
, output_format
, lang
and config
parameters.
Note: pdf
output format isn't supported by this function.
tessy.image_to_string(image, output_format='txt', lang=None, config=None)
Extracts any text from the given image and return the data as string for each specified format.
See image_to_file documentation for more details about
image
, output_format
, lang
and config
parameters.
Note: pdf
output format isn't supported by this function.
tessy.locate()
Tries to locate the Tesseract binary and returns its path if found.
- Checks if the cache file (
.TESSPATH
) is present inside the temporary directory.
If the file is found, its content is read and returned as an string. - (Windows only) Tries to read the Tesseract installation directory in the registry. The registry entry only exists if Tesseract has been previously installed using the Windows precompiled setups. If the Tesseract executable is located, its path is returned as an string.
tessy.locate_data()
Tries to locate the Tesseract data directory and returns its path if found.
- Checks if the
TESSDATA_PREFIX
environement variable has been set and return its value. - Tries to read the Tesseract installation directory in the registry. The registry entry only exists if Tesseract has been previously installed using any Windows precompiled setups. If the Tesseract executable is located, its path is returned as an string.
tessy.run(parameters=None, silent=False)
Run Tesseract with the given parameters and return the output as tuple.
parameters
contains the parameters to pass to Tesseract as string.e.g.:
"C:\my-image.png C:\my-image -l eng+deu box"
If
silent
is set toTrue
, warning messages won't be logged.
Returned values (3): status
(returncode/int), output
(stoutdata/string),
err_string
(sterrdata/string)
tessy.runnable()
Returns True
if the Tesseract's process can be started.
tessy.start(parameters=None, silent=False)
Alias of run
tessy.tesseract_version()
Returns the LooseVersion
representation of Tesseract's version.
tessy.clear_cache()
Tries to remove the .TESSPATH
cache file and return True
if successfully deleted.
tessy.clear_temp(remove_all=True)
Removes temporary files created by Tesseract (excluding the cache file).
If
remove_all
is set toTrue
, all temporary files will be deleted.
Otherwise, only temporary files created during execution will be deleted.
tessy.sv_to_dict(sv_data, cell_delimiter='\t')
Converts and return the given separated-values
raw data as a dictionary.
sv_data
must contain a header row.
SV data sample (separated by a comma):
head1,head2,head3,head4
A1,A2,A3,A4
B1,B2,B3,B4
C1,C2,C3,C4
Output:
{
"head1": [
"A1", "B1", "C1"
],
"head2": [
"A2", "B2", "C2"
],
[...]
}
tessy.boxes_to_dict(boxes_data)
Converts and return the given boxes
data as a dictionary.
boxes_data
mustn't contain a header row.
Boxes data sample:
w 165 480 209 525 0
h 174 478 241 532 0
a 217 512 249 552 0
Output:
{
"char": [
"w", "h", "a"
],
"left": [
165, 174, 217
],
"bottom": [
480, 478, 512
],
"right": [
209, 241, 249
],
"top": [
525, 532, 552
],
"page": [
0, 0, 0
]
}
tessy.hocr_to_dict(hocr_data)
Converts and return the given hocr
XML data as a dictionary.
hocr_to_dict
requires thexmltodict
module to be installed.
The module is only imported if it has already been previously installed.
tessy.osd_to_dict(osd_data)
Converts and return the given osd
data as a dictionary.
OSD data sample:
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 71.00
Script: Latin
Script confidence: nan
Output:
{
"page_number": 0,
"orientation_in_degrees": 270,
"rotate": 90,
"orientation_confidence": 71.00,
"script": "Latin",
"script_confidence": None
}
Note: Dictionary keys are dynamically generated based on the data content.