Unable to detect table with longer header information
Closed this issue · 4 comments
Summary
Reading page with wide header info
Did you read the FAQ?
- I have read the FAQ
Did you search GitHub Discussions?
- I have searched the discussions
(Optional) PDF URL
https://hints.cancer.gov/dataset/HINTS5_Cycle4_STATA_20230618.zip
About your environment
Note about link: This is a link to the public version of the file because I'm using restricted data. It's the codebook file.
Result of tabula.environment_info():
Python version:
3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]
Java version:
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21)
OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)
tabula-py version: 2.7.0
platform: Windows-10-10.0.19045-SP0
uname:
uname_result(system='Windows', node='V-TanLab-DS', release='10', version='10.0.19045', machine='AMD64')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
What did you do when you faced the problem?
I am trying to read a ton of tables from a really long pdf. Most pages have some header text before a table section. Like this:
The reader is correctly skipping the header on this page and creating a table with just the table section of the page and ignoring the other content:
When it gets further down to this page:
It thinks that the header another table, so I end up with two tables:
and
The only thing I can think of is that this header row is really wide, so it thinks it's a table as well. Or because it's so wide the top line goes onto multiple lines maybe?
Additionally, I can't figure out why the following table from the same page is also messed up. It has 21 instead of the expected 7 columns. I'm just assuming that these issues are related and that if I fix one the other will also be fixed, but I'm not entirely sure.
I have tried many different settings to get this to ignore that header section, but it has only messed up the other tables and has not fixed the table I want to be fixed. Some examples are providing the pandas option for header=None, (I thought of providing the column names to the pandas options, but they are different on every page, so I didn't know how to use that), I tried the lattice argument True once and False once, I tried getting another package to tell me the number of pages in the pdf and looping through each page to get the table individually combined with multiple_tables set to True once and False once. There are probably more I'm not thinking of.
Code
Here is the code I'm using. I mentioned above as well that I tried many other options besides this though.
import pandas as pd
from tabula import read_pdf
dfs = read_pdf('HINTS 5 Cycle 4 Restricted Codebook.pdf', pages='all')
Expected behavior
I expect the same behavior as shown before, but I will paste it again in this section:
The reader is correctly skipping the header on this page and creating a table with just the table section of the page and ignoring the other content:
Actual behavior
I showed the actual behavior above, but I will paste it here as well:
It thinks that the header another table, so I end up with two tables:
and
Related issues
No response
@dstone42 Next time, could you point to a specific PDF and page?
Looking at the result you've shared, it contains the first 7 lines as another table. This is because tabula-java's table detection algorithm, and the only option you can avoid would be setting area
option. Generally, PDF doesn't have table notation, so some detection failure may happen.
>>> dfs = tabula.read_pdf("HINTS 5 Cycle 4 Public Codebook.pdf", pages=27)
>>> dfs[0]
IGHSPANLI: High linguistically isolated strata (‘Census tracts in which 30% of the households have no adults over the age of 14 that report speaking English
0 ery well’)
1 ariable Name: HIGHSPANLI
2 ariable Label: High linguistically isolated st...
3 ariable Format: HIGHSPAN
4 riteria to receive Question: N/A
5 riteria description: N/A
6 ack to Table of Contents
7 NaN
8 NaN
9 NaN
>>> dfs[1]
HIGHSPANLI Value\rLabel Unweighted\rSample\rSize ... Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN NaN ... NaN NaN NaN
1 NaN NaN NaN ... NaN NaN NaN
2 NaN HIGHSPANLI Label ... NaN NaN NaN
3 NaN 1 NaN ... NaN 7.6 NaN
4 NaN 2 NaN ... NaN 92.4 NaN
[5 rows x 21 columns]
When I tried stream=True
option, it somewhat ignored the first 7 lines.
>>> dfs = tabula.read_pdf("HINTS 5 Cycle 4 Public Codebook.pdf", pages=27, stream=True)
>>> dfs[0]
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Cumulative Weighted Unnamed: 4
0 NaN NaN Unweighted NaN Unweighted Sample Weighted
1 NaN Value Sample Unweighted Sample Size Percent
2 HIGHSPANLI Label Size Percent Size (Estimated) (Estimated)
3 1 Yes 347 9 347 19,266,519 7.6
4 2 No 3,518 91 3,865 234,548,678 92.4
Anyway, it is a tabula-java limitation, so I don't know how to avoid it other than using area
option page by page.
I'm sorry I wasn't very clear about the pdf. It didn't have a place to put a note next to the pdf link, so I put it right following it in the next section. And I will include the page number next time as well.
I will try the stream option. I don't understand exactly what it's doing, but I'll let you know if that helps. That last table looks like exactly what I want.
Thanks!
Actually, I did some random tweaking for parameters, and I don't know why the streaming option works well. Anyway, it's not a bug but tabula-java's behavior. Setting an explicit area would be the last resort.
Other than that, I can help nothing, unfortunately.
I tried using the stream=True option for the whole pdf (meaning I used pages='all'), and it seems to work for the other tables as well. I haven't checked all 557, but with some decent spot checking, it looks like it worked. Thanks for your help!