Extracting non-tabular (1-tabula output) data from pdf, is it possible?
mikejokic opened this issue · 3 comments
Is your feature request related to a problem? Please describe.
Not a bug/problem. This library is absolutely amazing, and multiple orders of magnitude faster than anything else I have tried for reading tables.
I have a particular use-case where I am performing NLP on my pdfs. My pdfs contains complex tables (tabula does wonderful) as well as paragraph based text. So on any given page, I can have a table and text surrounding it.
Describe the solution you'd like
Using tabula, I can retain some structure from data in my tables. No issues there. However if I can get the (1-n) of a page, that doesn't contain tables - as a simple string, I can work with that data as well.
Describe alternatives you've considered
I have considered using bounding boxes to identify tables and mask them.
Additional context
Any ideas/suggestions?
Thanks for raising an issue. If I understand correctly, do you want to extract non-tabula string from PDF? I don't quite understand the point. Could you elaborate on it?
Hi, yes I do.
I'm doing some NLP on my documents (containing paragraph + tables), and simply reading in the pdf as string causes structural issues.
Using tabula I can import tabular relational data into my database while also retaining the structural information of the tables. I still want to perform NLP on my paragraph based text. If tabula can provide that also, it would help me retain the whole document structure (ex: on page 10 there are 2 tables (identified by tabula) and also this paragraph).
Understood. However, it's out of scope of this project, unfortunately. Please consider PyPDF or some other packages.