chezou/tabula-py

Unable to detect table with longer header information

Closed this issue · 4 comments

Summary

Reading page with wide header info

Did you read the FAQ?

  • I have read the FAQ

Did you search GitHub Discussions?

  • I have searched the discussions

(Optional) PDF URL

https://hints.cancer.gov/dataset/HINTS5_Cycle4_STATA_20230618.zip

About your environment

Note about link: This is a link to the public version of the file because I'm using restricted data. It's the codebook file.

Result of tabula.environment_info():
Python version:
    3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]
Java version:
    openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21)
OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)
tabula-py version: 2.7.0
platform: Windows-10-10.0.19045-SP0
uname:
    uname_result(system='Windows', node='V-TanLab-DS', release='10', version='10.0.19045', machine='AMD64')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

What did you do when you faced the problem?

I am trying to read a ton of tables from a really long pdf. Most pages have some header text before a table section. Like this:
image
The reader is correctly skipping the header on this page and creating a table with just the table section of the page and ignoring the other content:

image

When it gets further down to this page:
image
It thinks that the header another table, so I end up with two tables:

image

and

image

The only thing I can think of is that this header row is really wide, so it thinks it's a table as well. Or because it's so wide the top line goes onto multiple lines maybe?

Additionally, I can't figure out why the following table from the same page is also messed up. It has 21 instead of the expected 7 columns. I'm just assuming that these issues are related and that if I fix one the other will also be fixed, but I'm not entirely sure.

I have tried many different settings to get this to ignore that header section, but it has only messed up the other tables and has not fixed the table I want to be fixed. Some examples are providing the pandas option for header=None, (I thought of providing the column names to the pandas options, but they are different on every page, so I didn't know how to use that), I tried the lattice argument True once and False once, I tried getting another package to tell me the number of pages in the pdf and looping through each page to get the table individually combined with multiple_tables set to True once and False once. There are probably more I'm not thinking of.

Code

Here is the code I'm using. I mentioned above as well that I tried many other options besides this though.

import pandas as pd
from tabula import read_pdf
dfs = read_pdf('HINTS 5 Cycle 4 Restricted Codebook.pdf', pages='all')

Expected behavior

I expect the same behavior as shown before, but I will paste it again in this section:
image
The reader is correctly skipping the header on this page and creating a table with just the table section of the page and ignoring the other content:

image

Actual behavior

I showed the actual behavior above, but I will paste it here as well:
image
It thinks that the header another table, so I end up with two tables:

image

and

image

Related issues

No response

@dstone42 Next time, could you point to a specific PDF and page?

Looking at the result you've shared, it contains the first 7 lines as another table. This is because tabula-java's table detection algorithm, and the only option you can avoid would be setting area option. Generally, PDF doesn't have table notation, so some detection failure may happen.

>>> dfs = tabula.read_pdf("HINTS 5 Cycle 4 Public Codebook.pdf", pages=27)
>>> dfs[0]
  IGHSPANLI: High linguistically isolated strata (‘Census tracts in which 30% of the households have no adults over the age of 14 that report speaking English
0                                         ery well’)

1                           ariable Name: HIGHSPANLI

2  ariable Label: High linguistically isolated st...

3                           ariable Format: HIGHSPAN

4                   riteria to receive Question: N/A

5                           riteria description: N/A

6                           ack to Table of Contents

7                                                NaN

8                                                NaN

9                                                NaN

>>> dfs[1]
   HIGHSPANLI Value\rLabel Unweighted\rSample\rSize  ... Unnamed: 11 Unnamed: 12 Unnamed: 13
0         NaN          NaN                      NaN  ...         NaN         NaN         NaN
1         NaN          NaN                      NaN  ...         NaN         NaN         NaN
2         NaN   HIGHSPANLI                    Label  ...         NaN         NaN         NaN
3         NaN            1                      NaN  ...         NaN         7.6         NaN
4         NaN            2                      NaN  ...         NaN        92.4         NaN

[5 rows x 21 columns]

When I tried stream=True option, it somewhat ignored the first 7 lines.

>>> dfs = tabula.read_pdf("HINTS 5 Cycle 4 Public Codebook.pdf", pages=27, stream=True)
>>> dfs[0]
   Unnamed: 0 Unnamed: 1  Unnamed: 2  Unnamed: 3  Cumulative     Weighted   Unnamed: 4
0         NaN        NaN  Unweighted         NaN  Unweighted       Sample     Weighted
1         NaN      Value      Sample  Unweighted      Sample         Size      Percent
2  HIGHSPANLI      Label        Size     Percent        Size  (Estimated)  (Estimated)
3           1        Yes         347           9         347   19,266,519          7.6
4           2         No       3,518          91       3,865  234,548,678         92.4

Anyway, it is a tabula-java limitation, so I don't know how to avoid it other than using area option page by page.

I'm sorry I wasn't very clear about the pdf. It didn't have a place to put a note next to the pdf link, so I put it right following it in the next section. And I will include the page number next time as well.

I will try the stream option. I don't understand exactly what it's doing, but I'll let you know if that helps. That last table looks like exactly what I want.

Thanks!

Actually, I did some random tweaking for parameters, and I don't know why the streaming option works well. Anyway, it's not a bug but tabula-java's behavior. Setting an explicit area would be the last resort.

Other than that, I can help nothing, unfortunately.

I tried using the stream=True option for the whole pdf (meaning I used pages='all'), and it seems to work for the other tables as well. I haven't checked all 557, but with some decent spot checking, it looks like it worked. Thanks for your help!