Unexpected table extraction
Closed this issue · 3 comments
Summary of your issue
Wrongly parsing two columns as one while they should be different columns.
I assume tabula-py is just a wrapped for tabula-java, correct? Yet I'm getting different table columns when comparing tabula-py and tabula-java UI results.
Check list before
-
Did you read FAQ?
-
(Optional, but really helpful) Your PDF URL: https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf
-
Paste the output of
import tabula; tabula.environment_info()
on Python REPL:
Python version:
3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
Java version:
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
tabula-py version: 2.4.0
platform: Windows-10-10.0.22000-SP0
uname:
uname_result(system='Windows', node='DESKTOP-7T3U2OJ', release='10', version='10.0.22000', machine='AMD64', processor='Intel64 Family 6 Model 158 Stepping 13, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
What did you do when you faced the problem?
Checked result in the tabula-java UI.
Code:
dfs = tabula.read_pdf('https://mopng.gov.in/files/petroleumStatistics/monthlyProduction/MPR-for-the-month-of-June,2022.pdf', pages=9, lattice=True, pandas_options={'header': None})
df = dfs[0]
Expected behavior:
Target production during the month as second column and Month under review * as third.
Actual behavior:
Both Target production during the month and Month under review * in the second column.
Related Issues:
None
@alexDS12 this issue was automatically closed because it did not follow the issue template
@alexDS12 Were you able to solve the problem? Please help. I am also facing a similar issue.
@ollycredit I chose to move to another library as I didn't find a reliable solution, please create a new issue so author can try to help you.