Unexpected table extraction
alexDS12 opened this issue · 5 comments
tabula-py gives a different result than tabula-java, I'm not sure if any options are needed to surpass this issue or an actual bug.
Summary of your issue
Different columns are being treated as a single. "Target Production ..." and "Month under review" should be second and third columns respectively.
I assume tabula-py is a tabula-java wrapper, yet both have different results.
Check list before submit
-
Did you read FAQ? Yes
-
(Optional, but really helpful) Your PDF URL: https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf - page 9
-
Paste the output of
import tabula; tabula.environment_info()
on Python REPL: ?
Python version:
3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
Java version:
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
tabula-py version: 2.4.0
platform: Windows-10-10.0.22000-SP0
uname:
uname_result(system='Windows', node='DESKTOP-7T3U2OJ', release='10', version='10.0.22000', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 13, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
What did you do when you faced the problem?
Tried changing to guess=False, stream=True, lattice=False. And finally checked in the tabula-java's UI.
Code:
dfs = tabula.read_pdf('https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf', pages=9, lattice=True, pandas_options={'header': None})
df = dfs[0]
Expected behavior:
Both columns in second and third columns respectively.
Target production during the month and Month under review * in separate columns.
Actual behavior:
Both columns treated as one.
Target Production during the month and Month under review * in the same column.
Related Issues:
No related issues.
I can't confirm the "different result" you mentioned. Can you elaborate on it, especially how to call tabula-java?
The major difference between tabula-java and tabula-py for the default option is tabula-py giving guess=True
by default. Other than that, they should be equivalent.
Note that tabula.app provides GUI, so the options that will be used can be different from what you selected area or options.
As far as I tried tabula.app and tabula-py, they seem to be the same results.
In [9]: fname = "~/Downloads/MPR-for-the-month-of-June,2022.pdf"
In [10]: dfs2 = tabula.read_pdf(fname, pages=9)
In [11]: dfs2[0]
Out[11]:
Unnamed: 0 Unnamed: 1 Annexure-II (Para-2)
0 NaN Natural Gas Production during the month of Jun... NaN
1 NaN (Figures in Million Cubic Meters) NaN
2 Name of Target Production during the Cumulative Produc... % variation
3 Undertaking/Unit/State production (April-June) year during the
4 NaN during the Month Corresponding Preceding Targe... month under
5 NaN month under month of last month of production ... review over
6 NaN review * year current year ** during current y... Target
7 NaN year production
8 (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
9 1. ONGC 1627.43 1637.24 1684.40 1741.33 5076.75 5086.5... 0.69 0.60
10 Onshore 403.13 360.69 363.98 403.15 1210.74 1168.30 10... 8.57 -10.53
11 Andhra Pradesh 81.54 59.26 66.62 64.87 210.71 188.43 200.57 -... -6.05 -27.32
12 Assam 38.44 29.13 32.08 30.52 99.16 89.85 98.39 -9.20 -8.68 -24.22
13 Gujarat 70.59 73.17 79.20 74.92 219.08 221.66 246.75 -... 3.66
14 Rajasthan 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
15 Tamil Nadu 93.80 94.96 78.19 96.37 285.25 286.40 234.56 2... 22.10 1.23
16 Tripura 118.76 104.17 107.89 136.47 396.55 381.95 295.... 29.13 -12.29
17 Offshore 1224.29 1276.55 1320.42 1338.18 3866.01 3918.2... -1.45 4.27
18 Eastern Offshore 19.55 18.27 29.04 19.13 57.25 55.98 91.90 -37.... -6.51
19 Western Offshore 1204.75 1258.27 1291.39 1319.05 3808.76 3862.2... -0.56 4.44
20 2. OIL 305.55 246.75 230.14 251.12 924.79 742.78 675.... 10.01 -19.24
21 Assam 278.46 229.62 203.76 230.98 842.62 686.15 590.... 16.15 -17.54
22 Arunachal Pradesh 7.38 4.81 4.22 4.98 22.39 14.37 13.15 13.96 9.26 -34.86
23 Rajasthan 19.71 12.32 22.15 15.16 59.78 42.26 71.29 -44.... -37.51
24 3. Private / JVC 932.86 928.80 862.44 921.20 2782.31 2723.82 24... 11.59 -0.43
25 Onshore 222.91 216.57 246.64 209.88 684.46 636.04 713.... -2.84
26 Andhra Pradesh 1.05 1.21 1.22 1.24 3.50 3.66 3.83 -0.71 -4.55 14.85
27 Arunachal Pradesh 0.33 0.40 0.41 0.41 1.02 1.20 1.24 -2.90 -2.55 19.36
28 Assam 45.66 36.00 24.03 22.49 138.61 79.72 82.57 49.82 -3.45 -21.16
29 Gujarat 7.20 7.26 5.49 7.95 20.02 22.99 17.12 32.36 34.33 0.91
30 Rajasthan 166.36 169.45 212.96 175.43 514.32 521.53 600.... 1.86
31 Tamil Nadu 2.30 2.26 2.54 2.37 6.98 6.94 7.47 -10.89 -7.11 -1.92
32 CBM 59.32 56.54 56.70 57.45 178.29 169.20 172.09 -... -1.68 -4.69
33 Jharkhand (CBM) 0.62 0.63 0.35 0.66 1.85 1.85 0.94 79.70 97.11 0.38
34 Madhya Pradesh (CBM) 23.40 22.68 24.87 22.93 70.98 68.00 76.44 -8.8... -3.08
35 West Bengal (CBM) 35.30 33.23 31.48 33.86 105.46 99.35 94.72 5.57 4.89 -5.85
36 Offshore 650.63 655.69 559.10 653.87 1919.56 1918.58 15... 23.31 0.78
37 Eastern Offshore 618.05 635.44 545.52 638.43 1858.27 1868.64 15... 23.62 2.81
38 Gujarat Offshore 22.38 14.78 13.57 15.45 51.10 44.47 44.23 8.90 0.54 -33.96
39 Western Offshore 10.19 5.47 0.00 0.00 10.19 5.47 0.00 - - -46.35
40 TOTAL (1+2+3) 2865.83 2812.78 2776.98 2913.65 8783.85 8553.1... 4.71 -1.85
41 CBM 59.32 56.54 56.70 57.45 178.29 169.20 172.09 -... -1.68 -4.69
42 Onshore 931.59 824.01 840.76 864.15 2819.99 2547.12 24... 3.36 -11.55
43 Offshore 1874.92 1932.23 1879.52 1992.05 5785.57 5836.8... 5.51 3.06
Closing since no response. Feel free to reopen when you comment more in detail.
Apologies for the delay on this.
Can you please try with lattice=True
? It works very well for the past few reports I'm extracting data but all of a sudden for June and July, lattice mode does not split columns.
Yields same result as I initially reported whether using guess=True
or False
.
I'm not sure how to give more information when using tabula-java, I'm using the GUI, auto-detecting table on page 9 and on lattice extraction mode.
Here's an example of a successful lattice extraction. One in second column and the other in third respectively column names "1" and "2".
url = 'https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/march2022.pdf'
dfs = tabula.read_pdf(url, pages=9, lattice=True, pandas_options={'header': None})
dfs[0]
Whereas following extraction gives both columns in red in the second column.
url = 'https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf'
dfs = tabula.read_pdf(url, pages=9, lattice=True, pandas_options={'header': None})
dfs[0]
Same code snippet but both columns in red in the same column (name "1").
While checking PDF file, those are different columns:
Hope these two examples of odd and expected extractions help.
Sorry, I do try to keep compatibility with tabula-java, but I can't ensure to output the same results as tabula app.
I could do one suggestion to save template on tabula app and reuse it by tabula-py.
It may allow to understand what sort of options tabula app using. If the options can be reusable for tabula-py, you can extract same result.
Thanks for your help on this matter. It might be possible to run tabula-java on command line within a python script, if successful, I will comment output here as it might be helpful to others.
Thanks again.