chezou/tabula-py

Exception during read_pdf: SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

drjubbs opened this issue · 1 comments

Summary of your issue

tabula-py throws an exception processing an image dense PDF

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: https://api.environdec.com/api/v1/EPDLibrary/Files/6410b3fe-07f0-4766-b6b0-08da4d16d787/Data

  • Paste the output of import tabula; tabula.environment_info() on Python REPL:
    Python version:
    3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38:02) [MSC v.1916 64 bit (AMD64)]
    Java version:
    java version "18.0.2" 2022-07-19
    Java(TM) SE Runtime Environment (build 18.0.2+9-61)
    Java HotSpot(TM) 64-Bit Server VM (build 18.0.2+9-61, mixed mode, sharing)
    tabula-py version: 2.4.0
    platform: Windows-10-10.0.19044-SP0
    uname:
    uname_result(system='Windows', node='windows-knime', release='10', version='10.0.19044', machine='AMD64')
    linux_distribution: ('', '', '')
    mac_ver: ('', ('', '', ''), '')

What did you do when you faced the problem?

I was able to confirm the library works fine on simple PDFs lacking images.

Code:

tables4 = tabula.read_pdf(r"E:\PDF\celsa_circular_steel_EPD_Special_Steel_Wire.pdf", pages="all")

Expected behavior:

read_pdf should return an array of tables

Actual behavior:

Got stderr: Jul 29, 2022 3:51:02 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Jul 29, 2022 3:51:02 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Jul 29, 2022 3:51:02 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Jul 29, 2022 3:51:04 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Jul 29, 2022 3:51:04 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Jul 29, 2022 3:51:05 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Jul 29, 2022 3:51:05 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException

Related Issues:

None

This is not an issue because warning from PDFbox. See FAQ: https://tabula-py.readthedocs.io/en/latest/faq.html#i-got-a-warning-error-message-from-pdfbox-including-org-apache-pdfbox-pdmodel-is-it-the-cause-of-empty-dataframe

In this case, somehow setting guess=True (by default option) causes empty results. Try other options like stream=True.

In [1]: import tabula

In [4]: fname = "Data.pdf"

In [5]: tabula.read_pdf(fname, pages=6, guess=False)
Got stderr: Aug 06, 2022 5:35:40 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Aug 06, 2022 5:35:40 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

Out[5]:
[                       3. Product                                Product description  Unnamed: 0
 0                     information  Global Steel Wire is one of Europe’s leading s...         NaN
 1                             NaN               wire rod, with an extensive range of         NaN
 2                             NaN                  grades and diameters ranging from         NaN
 3                             NaN                   5.5 mm up to 52 mm, manufactured         NaN
 4                             NaN                   in accordance with international         NaN
 5                             NaN               standards and tailored to our custo-         NaN
 6                             NaN                    mers’ technical specifications.         NaN
 7                             NaN      Wire rod is available in low, medium and high         NaN
 8                             NaN       carbon steels (between 0.05% and 1.25%) with         NaN
 9                             NaN  different alloy grades (Al, B, Cr, Mn, Mo, P, Si,         NaN
 10                            NaN        S, among others), depending on its composi-         NaN
 11                            NaN                          tion and characteristics.         NaN
 12                            NaN     In addition, wire rod can be supplied in round         NaN
 13                            NaN    or hexagonal section, in different coil formats         NaN
 14                            NaN        and with a wide variety of heat and surface         NaN
 15                            NaN           treatments. Among its extensive range of         NaN
 16                            NaN         products, Global Steel Wire specialises in         NaN
 17                  Product name:    wire rods for cold heading, tyre reinforcement,         NaN
 18                            NaN          suspension springs, free cutting and cold         NaN
 19      Hot-rolled steel wire rod           drawing, as well as low, medium and high         NaN
 20                            NaN                                     carbon steels.         NaN
 21        Product identification:        Global Steel Wire is present in all sectors         NaN
 22                            NaN         where wire rod based products are manufac-         NaN
 23  Hot-rolled special steel wire          tured, and has become one of the European         NaN
 24   manufactured in electric arc         leaders in sectors with high technological         NaN
 25        furnace based on scrap.      demands, especially in the automotive sector.         NaN
 26                            NaN                                                NaN        10.0]