Extracting non-tabular (1-tabula output) data from pdf, is it possible?

Question

Extracting non-tabular (1-tabula output) data from pdf, is it possible?

mikejokic opened this issue 2 years ago · 3 comments

Is your feature request related to a problem? Please describe.

Not a bug/problem. This library is absolutely amazing, and multiple orders of magnitude faster than anything else I have tried for reading tables.

I have a particular use-case where I am performing NLP on my pdfs. My pdfs contains complex tables (tabula does wonderful) as well as paragraph based text. So on any given page, I can have a table and text surrounding it.

Describe the solution you'd like
Using tabula, I can retain some structure from data in my tables. No issues there. However if I can get the (1-n) of a page, that doesn't contain tables - as a simple string, I can work with that data as well.

Describe alternatives you've considered

I have considered using bounding boxes to identify tables and mask them.

Additional context

Any ideas/suggestions?

Answer 1 · 2023-03-06T02:06:05.000Z

Thanks for raising an issue. If I understand correctly, do you want to extract non-tabula string from PDF? I don't quite understand the point. Could you elaborate on it?

Answer 2 · 2023-03-07T13:05:29.000Z

Hi, yes I do.

I'm doing some NLP on my documents (containing paragraph + tables), and simply reading in the pdf as string causes structural issues.

Using tabula I can import tabular relational data into my database while also retaining the structural information of the tables. I still want to perform NLP on my paragraph based text. If tabula can provide that also, it would help me retain the whole document structure (ex: on page 10 there are 2 tables (identified by tabula) and also this paragraph).

Answer 3 · 2023-03-09T03:56:18.000Z

Understood. However, it's out of scope of this project, unfortunately. Please consider PyPDF or some other packages.