Extracting non tabular data from pdfs, is it possible?
mikejokic opened this issue · 1 comments
mikejokic commented
This library is absolutely amazing, and multiple orders of magnitude faster than anything else I have tried for reading tables.
I have a particular use-case where I am performing NLP on my pdfs. My pdfs contains complex tables (tabula does wonderful) as well as paragraph based text. So on any given page, I can have a table and text surrounding it.
Using tabula, I can retain some structure from data in my tables. No issues there. However if I can get the (1-n) of a page, that doesn't contain tables - as a simple string, I can work with that data as well.
Any ideas/suggestions?
github-actions commented
@mikejokic this issue was automatically closed because it did not follow the issue template