PDF parsing - detecting text boundary

Question

PDF parsing - detecting text boundary

Closed this issue a year ago · 1 comments

Description:

The purpose of this ticket is to develop a solution for quickly identifying text boundaries for any given PDF page. The solution should be able to automatically detect text regions, with the provision for human intervention to correct any errors in the boundary detection process. It should include a survey of existing open source tech that does this already (if any).

It should only be able to detect free text for now. It can ignore images, charts, tables etc.

This is only for PDFs for which libraries like PyPDF2 aren't useful in extracting text

Answer 1 · 2023-06-25T11:41:56.000Z

Current progress on this:

Line and Word level segmentation
Identification of columns
Methods to identify and order paragraphs
- Tesseract OCR that provides bounding boxes over the desired regions of text.
- Looking into Detectron2 for deep-learning based boundary detection
  - https://github.com/Layout-Parser/layout-parser/blob/main/installation.md
  - https://www.analyticsvidhya.com/blog/2021/05/document-layout-detection-and-ocr-with-detectron2/
Working on a small pop-up window to display and edit the obtained bounding box regions with human input.