chezou/tabula-py

tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar',

kdshreyas opened this issue · 3 comments

Summary of your issue

I encountered an issue while processing a PDF file where a specific page consistently triggers a "CalledProcessError" with the following command: ['java', '-Dfile.encoding=UTF8', '-jar']. This error disrupts the processing flow and prevents further execution.

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\Anaconda\envs\dev_env\lib\site-packages\tabula\tabula-1.0.5-jar-with-dependencies.jar', '--pages', '1', '--lattice', '--format', 'JSON'

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: ?
    test_pdf_output.pdf

  • Paste the output of import tabula; tabula.environment_info() on Python REPL: ?
    Python version:
    3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)]
    Java version:
    java version "1.8.0_371"
    Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
    tabula-py version: 2.3.0
    platform: Windows-10-10.0.19045-SP0
    uname:
    uname_result(system='Windows', node='IND-CHN-LT11760', release='10', version='10.0.19045', machine='AMD64')
    linux_distribution: ('MSYS_NT-10.0-19045', '3.1.7', '')
    mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: ?
  • Paste the output of java -version command on your terminal: ?
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: ?

What did you do when you faced the problem?

Code:

inputpdf = 'test_pdf_output.pdf'
page = 1
tables = tabula.read_pdf(inputpdf, pages = page, lattice = True, guess = False)
df = tables[0]

Expected behavior:

The command should execute successfully on the page of the PDF file, without encountering any errors.

Actual behavior:

The error "CalledProcessError" is encountered when processing the specified page within the PDF file.

Error from tabula-java:
Exception in thread "main" java.lang.IllegalArgumentException: lines must be orthogonal, vertical and horizontal
	at technology.tabula.Ruling.intersectionPoint(Ruling.java:214)
	at technology.tabula.Ruling.findIntersections(Ruling.java:378)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.findCells(SpreadsheetExtractionAlgorithm.java:134)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:63)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:41)
	at technology.tabula.CommandLineApp$TableExtractor.extractTablesSpreadsheet(CommandLineApp.java:452)
	at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:410)
	at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:180)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In [237], line 3
      1 inputpdf = 'output.pdf'
      2 page = 1
----> 3 tables = tabula.read_pdf(inputpdf, pages = page, lattice = True, guess = False)
      4 df = tables[0]
      5 df

File D:\Anaconda\envs\dev_env\lib\site-packages\tabula\io.py:322, in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
    317     raise ValueError(
    318         "{} is empty. Check the file, or download it manually.".format(path)
    319     )
    321 try:
--> 322     output = _run(java_options, kwargs, path, encoding)
    323 finally:
    324     if temporary:

File D:\Anaconda\envs\dev_env\lib\site-packages\tabula\io.py:80, in _run(java_options, options, path, encoding)
     77     args.append(path)
     79 try:
---> 80     result = subprocess.run(
     81         args,
     82         stdout=subprocess.PIPE,
     83         stderr=subprocess.PIPE,
     84         stdin=subprocess.DEVNULL,
     85         check=True,
     86     )
     87     if result.stderr:
     88         logger.warning("Got stderr: {}".format(result.stderr.decode(encoding)))

File D:\Anaconda\envs\dev_env\lib\subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    526     retcode = process.poll()
    527     if check and retcode:
--> 528         raise CalledProcessError(retcode, process.args,
    529                                  output=stdout, stderr=stderr)
    530 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\\Anaconda\\envs\\dev_env\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', '1', '--lattice', '--format', 'JSON', 'output.pdf']' returned non-zero exit status 1.

Related Issues:

chezou commented

Thanks for reporting the issue.

It looks like this is the tabula-java issue, which happens with he specific PDF. I can find similar issue in their repo.
tabulapdf/tabula-java#218

Would you mind if you could provide the PDF and report it on tabula-java?

chezou commented

Okay, I confirmed the issue happens with --lattice option for tabula-java with the file. It doesn't raise an error without --lattice option.

$ java  -Dfile.encoding=UTF8 -jar tabula/tabula-1.0.5-jar-with-dependencies.jar --pages 1 --lattice ~/Downloads/test_pdf_output.pdf
Exception in thread "main" java.lang.IllegalArgumentException: lines must be orthogonal, vertical and horizontal
	at technology.tabula.Ruling.intersectionPoint(Ruling.java:214)
	at technology.tabula.Ruling.findIntersections(Ruling.java:378)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.findCells(SpreadsheetExtractionAlgorithm.java:134)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:63)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:41)
	at technology.tabula.CommandLineApp$TableExtractor.extractTablesSpreadsheet(CommandLineApp.java:452)
	at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:410)
	at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:180)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)
$ java  -Dfile.encoding=UTF8 -jar tabula/tabula-1.0.5-jar-with-dependencies.jar --pages 1  ~/Downloads/test_pdf_output.pdf
"","Utah Medicaid Preferred Drug List - Effective April 1, 2023"
"",Quinolones
"",Last Brand
Preferred Drugs,Status Type Limits Mandatory 3-Month Additional Note
"",Update Required
Cipro suspension,Preferred Brand 02/01/10 Cipro susp
"ciprofloxacin 250, 500, 750mg Preferred",Generic 02/01/10
levofloxacin,Preferred Generic 02/01/16
moxifloxacin,Preferred Generic 01/01/21
"",Last Required Prior Brand
Non Preferred Drugs,Status Type Limits Additional Note
"",Update Authorization Form Required
Baxdela,Non Preferred Brand 10/01/17 Medication Coverage Exception
Cipro tablet,Non Preferred Brand 02/01/10 Medication Coverage Exception
ciprofloxacin 100mg tablet,Non Preferred Generic 01/01/22 Medication Coverage Exception
ciprofloxacin suspension,Non Preferred Generic 01/01/20 Medication Coverage Exception Cipro susp
ofloxacin tablet,Non Preferred Generic 02/01/10 Medication Coverage Exception
"",Tetracyclines
"",Last Brand
Preferred Drugs,Status Type Limits Mandatory 3-Month Additional Note
"",Update Required
doxycycline monohydrate,
"",Preferred Generic 01/01/20
"50, 100mg capsule",
doxycycline hyclate,
"",Preferred Generic 01/01/20
"50, 100mg",
minocycline,
"",Preferred Generic 01/01/20
"50, 75, 100mg capsule",
"",Last Required Prior Brand
Non Preferred Drugs,Status Type Limits Additional Note
"",Update Authorization Form Required
demeclocycline,Non Preferred Generic 01/01/20 Medication Coverage Exception
Doryx,Non Preferred Brand 01/01/20 Medication Coverage Exception
doxycycline (unless listed preferred),Non Preferred Generic 01/01/20 Medication Coverage Exception
Minocin,Non Preferred Brand 01/01/20 Medication Coverage Exception
minocycline ER capsule,Non Preferred Generic 12/01/22 Medication Coverage Exception
minocycline tablet,Non Preferred Generic 01/01/20 Medication Coverage Exception
Minolira,Non Preferred Brand 01/01/20 Medication Coverage Exception
Nuzyra,Non Preferred Brand 01/01/20 Medication Coverage Exception
Solodyn,Non Preferred Brand 01/01/20 Medication Coverage Exception
tetracycline,Non Preferred Generic 01/01/20 Medication Coverage Exception
Vibramycin,Non Preferred Brand 01/01/20 Medication Coverage Exception
Ximino,Non Preferred Brand 01/01/20 Medication Coverage Exception
"",Page 11 of 111

This should hit some issues on tabula-java side.

Close as tabula-py doesn't have any workaround.

Hey @chezou,

Thanks for the quick reply, I have created a issue tabulapdf/tabula-java#529 as suggested.