chezou/tabula-py

Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True

joeanton719 opened this issue · 1 comments

Summary of your issue

example.pdf

I am trying to extract all rows from the PDF attached. The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color. How do I get all rows regardless of the color the rows are in?

Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: ?

  • Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

Python version:
3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
Java version:
java version "1.8.0_321"
Java(TM) SE Runtime Environment (build 1.8.0_321-b07)
Java HotSpot(TM) Client VM (build 25.321-b07, mixed mode)
tabula-py version: 2.3.0
platform: Windows-10-10.0.19044-SP0

What did you do when you faced the problem?

I created a function to parse multiple PDF's into a single file

Code:

def parse_streampdf_pages(pdf, col_dims):
    pages = read_pdf(
        pdf,
        pages = "all",
        guess = False,
        stream = True,
        silent=True,
        area = [0,0,1100,1100], 
        columns = col_dims,
        pandas_options = {'header': None}
    )
    
    temp_df = pd.concat(pages)
    temp_df["filename"] = pdf
    
    return temp_df

Expected behavior:

I expected to get a data frame with 19 rows.

Actual behavior:

The output shows only those rows which are in the grey background color. İt doesn't show rows with the white background color.

Related Issues:

@joeanton719 this issue was automatically closed because it did not follow the issue template