Introduction:

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine (an Open Source project at Google), converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in Linux as well.

PyTesser:

https://github.com/wannamit/pytesser/

http://code.google.com/p/pytesser/

Tesseract:

http://code.google.com/p/tesseract-ocr/

Dependencies:

PIL is required to work with images in memory.

https://pypi.python.org/pypi/Pillow/

This version of PyTesser has been tested with Python 3.4 in Windows 10 and Python 2.7 in Ubuntu 14.04 LTS.
For Linux:
- Add executable permission to all .exe file chmod +x dlltest.exe tesseract.exe
- Install wine sudo apt-get install wine
- Execute winetricks mfc42

Installation:

No installation. Just download - import - use.

Usage:

>>> from pytesser import *
>>> img = Image.open("my_image_path")
>>> text = image_to_string(img)
>>> print text

>>> try:
... 	text = image_file_to_string("my_tif_image", graceful_errors=False)
... except errors.Tesser_General_Exception, value:
... 	print("`tif_image` is incompatible filetype.  Try graceful_errors=True")
... 	print(value)
...

tif image file is incompatible filetype. Try graceful_errors=True Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression Tessedit:Error:Read of file failed:fnord.tif Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3

File Dependencies:

pytesser.py Main module for importing
util.py Utility functions used by pytesser.py
errors.py Interprets exceptions thrown by Tesseract
tesseract.exe Executable called by pytesser.py
tessdata/ Resources used by tesseract.exe

pdf2jpg.py

This file is a pdf to jpg converter written by Ned Batchelder and original Py2 code can be found here

###Usage

python3 pdf2jpg.py <path_of_pdf_file>

pdf2txt.py