Optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.
Steps:
- convert pdf to image, if multiple pdf pages then each page into the individual image file.
- convert the colour image into a grayscale image
- read/create target bounding boxes
- with help of tesseract to recognize the character in the image
- create an r-tree index for each bounding box of tesseract output data.
- find the intersection of the target bounding box in an r-tree index.
- get the required target index from the data frame, continue processing text if necessary.
- repeat above step remaining pages.
Refrence: