/OCR_FOR_PDFS

Optical Character Recognition for Scanned Documents

Primary LanguagePython

OCR_FOR_PDFS

Optical Character Recognition for Scanned Documents

The program generates text from a scanned document in the form of a pdf, irrespective of the length of the document.
The code uses TesseractOCR to perform the task, and openCV to pre process the image which is generate from pdf2image module.

The accuracy of the OCR can be improved by:

  • Pre processing of the image using openCV can result in better accuracy.
  • Using a spell check after the extraction of the text can also improve the flow.