Unexpected table extraction

Question

Unexpected table extraction

Closed this issue 2 years ago · 3 comments

Summary of your issue

Wrongly parsing two columns as one while they should be different columns.

I assume tabula-py is just a wrapped for tabula-java, correct? Yet I'm getting different table columns when comparing tabula-py and tabula-java UI results.

Check list before

Did you read FAQ?
(Optional, but really helpful) Your PDF URL: https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf
Paste the output of import tabula; tabula.environment_info() on Python REPL:
Python version:
3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
Java version:
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
tabula-py version: 2.4.0
platform: Windows-10-10.0.22000-SP0
uname:
uname_result(system='Windows', node='DESKTOP-7T3U2OJ', release='10', version='10.0.22000', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 13, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

What did you do when you faced the problem?

Checked result in the tabula-java UI.

Code:

dfs = tabula.read_pdf('https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf', pages=9, lattice=True, pandas_options={'header': None})
df = dfs[0]

Expected behavior:

Target production during the month as second column and Month under review * as third.

Actual behavior:

Both Target production during the month and Month under review * in the second column.

Related Issues:

None

Answer 1 · 2022-08-01T16:18:19.000Z

@alexDS12 this issue was automatically closed because it did not follow the issue template

Answer 2 · 2022-09-12T07:43:13.000Z

@alexDS12 Were you able to solve the problem? Please help. I am also facing a similar issue.

Answer 3 · 2022-09-21T21:07:10.000Z

@ollycredit I chose to move to another library as I didn't find a reliable solution, please create a new issue so author can try to help you.