chezou/tabula-py

read_pdf_with_template using a multi-page template passed as a binary fails

will-byrne-cardano opened this issue · 3 comments

Summary of your issue

I noticed that when you call the function read_pdf_with_template and the template file you pass in is a multi page binary file, a ValueError is raised that says the file is empty. This doesn't happen when you pass the file path to the template json in and both pages get read successfully. The reason I have to pass a binary file in is due to how my program is accessing this data - it is running on a service that is downloading both the pdf to read and the template from a cloud file system so needs to download these into a binary file. The current work around I have for this is to split the template into two (or more) files - one for each page - and make repeated calls to read_pdf_with_template for the different pages and then combining results. While this is not a big problem, it feels like it should be avoidable.

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: ?

  • Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

Python version:
    3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]
Java version:
    java version "1.8.0_321"
Java(TM) SE Runtime Environment (build 1.8.0_321-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.321-b07, mixed mode)
tabula-py version: 2.3.0
platform: Windows-10-10.0.19041-SP0
uname:
    uname_result(system='Windows', node='XXXXX', release='10', version='10.0.19041', machine='AMD64', processor='Intel64 Family 6 Model 142 Stepping 12, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

What did you do when you faced the problem?

As mentioned above, I was able to provide a work-around in my code by creating one template file per page I wanted to read, and passing these sequentially into the function read_pdf_with_template. This can be recreated using the files in the tests/resources dir.

Code:

from tabula import read_pdf_with_template


path = "path/to/tests/resources/data.pdf"
template = "path/to/tests/resources/data.tabula-template.json"

with open(path, 'rb') as pdf:
    with open(template, 'rb') as temp:
        data = read_pdf_with_template(pdf, temp)

Expected behavior:

I expected this to return a list of four dataframes - as is the behaviour when we call the function by passing in the absolute paths of the pdf file and the template file e.g. the below code successfully reads the pdf using the template file

from tabula import read_pdf_with_template


path = "path/to/tests/resources/data.pdf"
template = "path/to/tests/resources/data.tabula-template.json"
data = read_pdf_with_template(path, template)

Actual behavior:

When we call the function read_pdf_with_template passing a binary as the template json (that is more than one page) we get the error below

ValueError: C:\Users\XXXX\AppData\Local\Temp\[uuid].pdf is empty. Check the file, or download it manually.

Related Issues:

Was playing around a bit more with the above, I think this could be because at this point in the code the pdf is being read again (but for the next page of the template) but there is nothing that tells the path_or_buffer representing the pdf file to seek(0) so the shutil.copyfileobj copies no data over and the error gets raised because the file in the temp location is empty

Adding the below code in seems to fix this problem, and not break other functionality

    elif is_file_like(path_or_buffer):
        path_or_buffer.seek(0)  #<----- New line
        filename = os.path.join(gettempdir(), "{}{}".format(uuid.uuid4(), suffix))

        with open(filename, "wb") as f:
            shutil.copyfileobj(path_or_buffer, f)

@will-byrne-cardano Thanks for reporting! I fixed it and released 2.4.0

@chezou thanks very much!