Samagra-Development/ai-tools

PDF parsing - detecting text boundary

Closed this issue · 1 comments

Description:

The purpose of this ticket is to develop a solution for quickly identifying text boundaries for any given PDF page. The solution should be able to automatically detect text regions, with the provision for human intervention to correct any errors in the boundary detection process. It should include a survey of existing open source tech that does this already (if any).

It should only be able to detect free text for now. It can ignore images, charts, tables etc.

This is only for PDFs for which libraries like PyPDF2 aren't useful in extracting text

Current progress on this: