PDF parsing - detecting text boundary
Closed this issue · 1 comments
Gautam-Rajeev commented
Description:
The purpose of this ticket is to develop a solution for quickly identifying text boundaries for any given PDF page. The solution should be able to automatically detect text regions, with the provision for human intervention to correct any errors in the boundary detection process. It should include a survey of existing open source tech that does this already (if any).
It should only be able to detect free text for now. It can ignore images, charts, tables etc.
This is only for PDFs for which libraries like PyPDF2 aren't useful in extracting text
shrivastava95 commented
Current progress on this:
- Line and Word level segmentation
- Identification of columns
- Methods to identify and order paragraphs
- Tesseract OCR that provides bounding boxes over the desired regions of text.
- Looking into Detectron2 for deep-learning based boundary detection
- Working on a small pop-up window to display and edit the obtained bounding box regions with human input.