chezou/tabula-py

Unexpected table extraction

alexDS12 opened this issue · 5 comments

tabula-py gives a different result than tabula-java, I'm not sure if any options are needed to surpass this issue or an actual bug.

Summary of your issue

Different columns are being treated as a single. "Target Production ..." and "Month under review" should be second and third columns respectively.
I assume tabula-py is a tabula-java wrapper, yet both have different results.

Check list before submit

  • Did you read FAQ? Yes

  • (Optional, but really helpful) Your PDF URL: https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf - page 9

  • Paste the output of import tabula; tabula.environment_info() on Python REPL: ?
    Python version:
    3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
    Java version:
    openjdk version "11.0.6" 2020-01-14
    OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
    OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
    tabula-py version: 2.4.0
    platform: Windows-10-10.0.22000-SP0
    uname:
    uname_result(system='Windows', node='DESKTOP-7T3U2OJ', release='10', version='10.0.22000', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 13, GenuineIntel')
    linux_distribution: ('', '', '')
    mac_ver: ('', ('', '', ''), '')

What did you do when you faced the problem?

Tried changing to guess=False, stream=True, lattice=False. And finally checked in the tabula-java's UI.

Code:

dfs = tabula.read_pdf('https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf', pages=9, lattice=True, pandas_options={'header': None})
df = dfs[0]

Expected behavior:

Both columns in second and third columns respectively.

image

Target production during the month and Month under review * in separate columns.

Actual behavior:

Both columns treated as one.

image

Target Production during the month and Month under review * in the same column.

Related Issues:

No related issues.

I can't confirm the "different result" you mentioned. Can you elaborate on it, especially how to call tabula-java?

The major difference between tabula-java and tabula-py for the default option is tabula-py giving guess=True by default. Other than that, they should be equivalent.

Note that tabula.app provides GUI, so the options that will be used can be different from what you selected area or options.

As far as I tried tabula.app and tabula-py, they seem to be the same results.

tabula.app result:
image

In [9]: fname = "~/Downloads/MPR-for-the-month-of-June,2022.pdf"

In [10]: dfs2 = tabula.read_pdf(fname, pages=9)

In [11]: dfs2[0]
Out[11]:
                Unnamed: 0                                         Unnamed: 1 Annexure-II (Para-2)
0                      NaN  Natural Gas Production during the month of Jun...                  NaN
1                      NaN                  (Figures in Million Cubic Meters)                  NaN
2                  Name of  Target Production during the Cumulative Produc...          % variation
3   Undertaking/Unit/State                       production (April-June) year           during the
4                      NaN  during the Month Corresponding Preceding Targe...          month under
5                      NaN  month under month of last month of production ...          review over
6                      NaN  review * year current year ** during current y...               Target
7                      NaN                                               year           production
8                      (1)               (2) (3) (4) (5) (6) (7) (8) (9) (10)                 (11)
9                  1. ONGC  1627.43 1637.24 1684.40 1741.33 5076.75 5086.5...            0.69 0.60
10                 Onshore  403.13 360.69 363.98 403.15 1210.74 1168.30 10...          8.57 -10.53
11          Andhra Pradesh  81.54 59.26 66.62 64.87 210.71 188.43 200.57 -...         -6.05 -27.32
12                   Assam    38.44 29.13 32.08 30.52 99.16 89.85 98.39 -9.20         -8.68 -24.22
13                 Gujarat  70.59 73.17 79.20 74.92 219.08 221.66 246.75 -...                 3.66
14               Rajasthan            0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00            0.00 0.00
15              Tamil Nadu  93.80 94.96 78.19 96.37 285.25 286.40 234.56 2...           22.10 1.23
16                 Tripura  118.76 104.17 107.89 136.47 396.55 381.95 295....         29.13 -12.29
17                Offshore  1224.29 1276.55 1320.42 1338.18 3866.01 3918.2...           -1.45 4.27
18        Eastern Offshore  19.55 18.27 29.04 19.13 57.25 55.98 91.90 -37....                -6.51
19        Western Offshore  1204.75 1258.27 1291.39 1319.05 3808.76 3862.2...           -0.56 4.44
20                  2. OIL  305.55 246.75 230.14 251.12 924.79 742.78 675....         10.01 -19.24
21                   Assam  278.46 229.62 203.76 230.98 842.62 686.15 590....         16.15 -17.54
22       Arunachal Pradesh        7.38 4.81 4.22 4.98 22.39 14.37 13.15 13.96          9.26 -34.86
23               Rajasthan  19.71 12.32 22.15 15.16 59.78 42.26 71.29 -44....               -37.51
24        3. Private / JVC  932.86 928.80 862.44 921.20 2782.31 2723.82 24...          11.59 -0.43
25                 Onshore  222.91 216.57 246.64 209.88 684.46 636.04 713....                -2.84
26          Andhra Pradesh           1.05 1.21 1.22 1.24 3.50 3.66 3.83 -0.71          -4.55 14.85
27       Arunachal Pradesh           0.33 0.40 0.41 0.41 1.02 1.20 1.24 -2.90          -2.55 19.36
28                   Assam   45.66 36.00 24.03 22.49 138.61 79.72 82.57 49.82         -3.45 -21.16
29                 Gujarat        7.20 7.26 5.49 7.95 20.02 22.99 17.12 32.36           34.33 0.91
30               Rajasthan  166.36 169.45 212.96 175.43 514.32 521.53 600....                 1.86
31              Tamil Nadu          2.30 2.26 2.54 2.37 6.98 6.94 7.47 -10.89          -7.11 -1.92
32                     CBM  59.32 56.54 56.70 57.45 178.29 169.20 172.09 -...          -1.68 -4.69
33         Jharkhand (CBM)           0.62 0.63 0.35 0.66 1.85 1.85 0.94 79.70           97.11 0.38
34    Madhya Pradesh (CBM)  23.40 22.68 24.87 22.93 70.98 68.00 76.44 -8.8...                -3.08
35       West Bengal (CBM)    35.30 33.23 31.48 33.86 105.46 99.35 94.72 5.57           4.89 -5.85
36                Offshore  650.63 655.69 559.10 653.87 1919.56 1918.58 15...           23.31 0.78
37        Eastern Offshore  618.05 635.44 545.52 638.43 1858.27 1868.64 15...           23.62 2.81
38        Gujarat Offshore     22.38 14.78 13.57 15.45 51.10 44.47 44.23 8.90          0.54 -33.96
39        Western Offshore             10.19 5.47 0.00 0.00 10.19 5.47 0.00 -             - -46.35
40           TOTAL (1+2+3)  2865.83 2812.78 2776.98 2913.65 8783.85 8553.1...           4.71 -1.85
41                     CBM  59.32 56.54 56.70 57.45 178.29 169.20 172.09 -...          -1.68 -4.69
42                 Onshore  931.59 824.01 840.76 864.15 2819.99 2547.12 24...          3.36 -11.55
43                Offshore  1874.92 1932.23 1879.52 1992.05 5785.57 5836.8...            5.51 3.06

Closing since no response. Feel free to reopen when you comment more in detail.

Apologies for the delay on this.

Can you please try with lattice=True? It works very well for the past few reports I'm extracting data but all of a sudden for June and July, lattice mode does not split columns.

Yields same result as I initially reported whether using guess=True or False.

I'm not sure how to give more information when using tabula-java, I'm using the GUI, auto-detecting table on page 9 and on lattice extraction mode.

Here's an example of a successful lattice extraction. One in second column and the other in third respectively column names "1" and "2".

url = 'https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/march2022.pdf'
dfs = tabula.read_pdf(url, pages=9, lattice=True, pandas_options={'header': None})
dfs[0]

image

Whereas following extraction gives both columns in red in the second column.

url = 'https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf'
dfs = tabula.read_pdf(url, pages=9, lattice=True, pandas_options={'header': None})
dfs[0]

image

Same code snippet but both columns in red in the same column (name "1").

While checking PDF file, those are different columns:
image

Hope these two examples of odd and expected extractions help.

Sorry, I do try to keep compatibility with tabula-java, but I can't ensure to output the same results as tabula app.

I could do one suggestion to save template on tabula app and reuse it by tabula-py.

It may allow to understand what sort of options tabula app using. If the options can be reusable for tabula-py, you can extract same result.

Thanks for your help on this matter. It might be possible to run tabula-java on command line within a python script, if successful, I will comment output here as it might be helpful to others.
Thanks again.