Add a way to set areas for non-existent pages in template
Closed this issue · 4 comments
Is your feature request related to a problem? Please describe.
I need to import a bunch of pdfs, and I'm using a template to set the interesting areas.
Most of my documents are of 2 pages, but some of them have 3 pages and one is just one page long.
Apart from the first page, the others have the same structure (basically the pdfs have some tabular data and may not fit in one or two pdf pages)
Describe the solution you'd like
Ideally I want to set up a single template, with the details I need for the first page, and set an area for pages 2 and 3 (or at most copy an area with a different page attribute).
If I do so, now, however, read_pdf_with_template
raise a CalledProcessError
(basically because it tries to invoke tabula-1.0.5-jar-with-dependencies.jar
on page 3 on a 2-page document)
I had a quick look at the jar --help
but it seems to me there is no "ignore wrong pages".
I've also tried to explicitly pass the number of pages, but it seems that, when using a template, the options are copied and passed to multiple invocation of the jar, resulting in applying the area of the template in page 1 on pages 1 and 2...
Describe alternatives you've considered
a workaround (that I don't really like) could be to prepare multiple templates, one for each "size" of the pdfs I need to import, but that also means I need to at least use another pdf library to get the page number of the pdfs and choose the "right" template
Additional context
Thanks for creating an issue.
If you want to use the same option to all pages, I would suggest to call tabula.template.load_template
directly.
Here is the example:
>>> import tabula
>>> fname = "./tests/resources/data.tabula-template.json"
>>> o = tabula.template.load_template(fname)
>>> o
[TabulaOption(pages=1, guess=False, area=[124.0, 154.0, 531.745, 565.57], relative_area=False, lattice=False, stream=True, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True), TabulaOption(pages=2, guess=True, area=[[123.999, 154.0, 210.444, 453.88], [410.996, 154.0, 497.441, 487.54]], relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True), TabulaOption(pages=3, guess=True, area=[123.999, 154.0, 322.899, 235.855], relative_area=False, lattice=False, stream=False, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True)]
>>> o[0]
TabulaOption(pages=1, guess=False, area=[124.0, 154.0, 531.745, 565.57], relative_area=False, lattice=False, stream=True, password=None, silent=None, columns=None, relative_columns=False, format=None, batch=None, output_path=None, options='', multiple_tables=True)
>>> o[0].pages
1
>>> o[0].pages="all"
>>> tabula.read_pdf(pdf_path, options=" ".join(o[0].build_option_list()))
'pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Aug. 22, 2023 9:08:52 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Aug. 22, 2023 9:08:52 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Aug. 22, 2023 9:08:53 P.M. org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 808 fonts
[ Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2, Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa, Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 145 6.7 3.3 5.7 2.5 virginica
1 146 6.7 3.0 5.2 2.3 virginica
2 147 6.3 2.5 5.0 1.9 virginica
3 148 6.5 3.0 5.2 2.0 virginica
4 149 6.2 3.4 5.4 2.3 virginica
5 150 5.9 3.0 5.1 1.8 virginica, len supp dose
0 4.2 VC 0.5
1 11.5 VC 0.5
2 7.3 VC 0.5
3 5.8 VC 0.5
4 6.4 VC 0.5
5 10.0 VC 0.5
6 11.2 VC 0.5
7 11.2 VC 0.5
8 5.2 VC 0.5
9 7.0 VC 0.5
10 16.5 VC 1.0
11 16.5 VC 1.0
12 15.2 VC 1.0
13 17.3 VC 1.0
14 22.5 VC 1.0]
Of course, there is room for improvement to pass TabulaOption
to tabula.read_pdf
directly, but before that, I'd love to hear your feedback.
Close since no response.
uuhhh.. sorry, I didn't reply sooner, but this is a hobby project I'm working on.
While I understand your suggestion, this means that the template are not longer only defined in the json file, but explicitly manipulated... I think that at the moment I'll stuck with multiple templates and a simple logic to choose what to use for the extraction
Thanks for your response.
Unfortunately, tabula-py also doesn't know the page size of a PDF, so we can only use pages="all"
option for handling unknown pages.