chezou/tabula-py

Extracting non tabular data from pdfs, is it possible?

mikejokic opened this issue · 1 comments

This library is absolutely amazing, and multiple orders of magnitude faster than anything else I have tried for reading tables.

I have a particular use-case where I am performing NLP on my pdfs. My pdfs contains complex tables (tabula does wonderful) as well as paragraph based text. So on any given page, I can have a table and text surrounding it.

Using tabula, I can retain some structure from data in my tables. No issues there. However if I can get the (1-n) of a page, that doesn't contain tables - as a simple string, I can work with that data as well.

Any ideas/suggestions?

@mikejokic this issue was automatically closed because it did not follow the issue template