- This console tool was designed to OCR TIFF Files, extract a value, and save the document as a searchable PDF with the extracted value as document name.
- This PoC was developed for the requirements of a customer in the UK&I.
- The purpose of this PoC Tool is to enable the customer to take 3 x TIFF formatted document types (Invoice, Proof of Delivery, Bill Of Materials for example), detect a value in each document type and save the document as a searchable PDF with the extracted value as the document name.
- This PoC tool also has the function of detecting the OCR Dict / Zone value. This is then defined in the IFDT.conf file for OCR operations.
-
Python - version 3.10.5 - https://docs.python.org/3/
- Modules - https://docs.python.org/3/tutorial/modules.html
- pytesseract - https://docs.python.org/3/library/csv.html#module-csv
- os - https://docs.python.org/3/library/os.html#module-os
- os.path - https://docs.python.org/3/library/os.path.html
- datetime - https://docs.python.org/3/library/datetime.html
- configparser - https://docs.python.org/3/library/configparser.html
- cv2 - https://pypi.org/project/opencv-python/
- Modules - https://docs.python.org/3/tutorial/modules.html
-
Tesseract Open Source OCR Engine
The tool provides the following capabilities:
- OCR of documents stored in seperate document type folders in TIFF format.
- OCR Document, extract value and save document based on extracted value.
- OCR Documents are output as searchable PDF.
- Output of logging for all functions to txt file.
-
Setup the project from source files;
-
Download and install Python 3.10.5 from https://www.python.org/downloads/
-
Ensure Python is added to environment SYS Path.
-
Install deps from the provided requirements.txt file as per below;
- pip3 install -r requirements.txt
-
Download and install the Tesseract OCR Engine for windows from https://tesseract-ocr.github.io/tessdoc/Downloads.html
-
-
Running the project from source files;
- Ensure all Dependencies are installed.
- Populate the conf file as per the example entry provided.
- Ensure input documents as in the correct Input/doctype directory.
- Run 'python ifdt.py'
- Check Output dir for output and logs for errors.
Created by James Dunne - James.Dunne1@gmail.com