Tesseract PDF Image Text Extractor

This is a simple tool to extract text from scanned pdf images using tesseract OCR as the OCR engine.

As of now I am providing the document for running this tool in Windows platforms.

Main dependencies of this project are given below:

canvas - Visit GitHub of canvas

canvas npm package has some compatibility issues with windows x86 based windows platform
node-tesseract-ocr - Visit GitHub of node-tesseract-ocr
tesseract ocr engine - Visit GitHub of tesseract ocr engine

Tesseract at UB Mannheim

The above university providing ready to install binaries for Windows platforms (both x86 and x64).

The details are available here at their GitHub wiki.

Install all of the node dependencies by running

npm install

Also you need to install the tesseract binaries from here.

Create the below folders in the same directory.

processedText, processedImages, fileSource

run the below command

npm run-script extract

The above command will look for a file named test.pdf in the fileSource folder and start extracting.

npm --fileName="FILENAME" run-script extract

We can also specify a filename of our wish. We need to replace the FILENAME with our filename.

The extracted text will be saved in separate text file for each page in the processedText folder.

Important: Files to be processed should be inside the folder fileSource

Happy extraction...!