chezou/tabula-py

dont ignore empty columns in tables spanning multiple pages

Closed this issue · 1 comments

Is your feature request related to a problem? Please describe.

I have a pdf file with multiple pages. From page two to the end (page 29) I have one table basically spanning over all pages after page 2. now it happens that on some pages some columns might be empty as there are no values for these columns for the rows on that page. also the first line in the table is the name of the table.
the second line contains the actual header.

Now it seems to be impossible to read in the table. i cannot map the columns from page 2 line 2 to each df because some empty columns are just ignored.

Describe the solution you'd like

i would like tabula to not ignore empty columns in tables where the table is over multiple pages.

Describe alternatives you've considered

the only alternative i see at this point is either trying to copy 28 pages of values by hand or trying to parse the pdf myself in python. but there i see my chances very low.

Additional context

Sorry, if that request seems to be unnecessary. I spent more than an hour searching for a solution to this and I was unable to find a solution.

chezou commented

The issue looks line not feature request. It'd be great if you could create an issue with bug report having a specific PDF. Presumably, the behavior hits tabula-java limitations.