chezou/tabula-py

HTTP Error 400 Bad Request when reading from AWS S3 PreSigned URL

jgcmarins opened this issue ยท 6 comments

Summary of your issue

I am trying to read a pdf from AWS S3 PreSigned URL and I experiencing the following error:

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH")[0]
  File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/io.py", line 311, in read_pdf
    path, temporary = localize_file(input_path, user_agent)
  File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
    req = urlopen(path_or_buffer)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: it's a AWS S3 PreSigned URL, sorry, can't share, but the PreSigned is working because I am able to access the PDF through the browser

  • Paste the output of import tabula; tabula.environment_info() on Python REPL:

Python version:
    3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0]
Java version:
    openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)
tabula-py version: 2.4.0
platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
uname:
    uname_result(system='Linux', node='DESKTOP-KVLP8PC', release='5.10.16.3-microsoft-standard-WSL2', version='#1 SMP Fri Apr 2 22:23:49 UTC 2021', machine='x86_64', processor='x86_64')
linux_distribution: ('Ubuntu', '20.04', 'focal')
mac_ver: ('', ('', '', ''), '')
None
  • Paste the output of python --version command on your terminal: Python 3.8.10
  • Paste the output of java -version command on your terminal:
openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: Windows 10 with WSL

What did you do when you faced the problem?

Tried to search on both Google and GitHub issues to see if anyone else is facing the same issue and found nothing.

Code:

df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH")[0]
df.to_csv('./test.csv', encoding='utf-8')
print(df)

Expected behavior:

CSV with PDF data.

Actual behavior:

Traceback (most recent call last):
  File "main.py", line 3, in <module>
    df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH")[0]
  File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/io.py", line 311, in read_pdf
    path, temporary = localize_file(input_path, user_agent)
  File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
    req = urlopen(path_or_buffer)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

Related Issues:

Added use_raw_url option for the case to avoid the Presigned URL. Can you try it once you have a time?

Hey @chezou, thank you for the quick fix, but I am still seeing the same error :(

I've noticed that in the URL you added to the PR example, there's a .pdf suffix in the file name: https://tabula-py-test.s3.ca-central-1.amazonaws.com/data.pdf?SIGNED_HASH

Meanwhile, in mine, there's no .pdf suffix: https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH"

Also, in yours, there's a response-content-disposition=inline parameter.

Do you think any of those could make any difference?

Code with use_raw_url

import tabula

df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH", 
    pages="all", use_raw_url=True)

Stack trace

 File "main.py", line 3, in <module>
    df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH",
  File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/io.py", line 311, in read_pdf
    path, temporary = localize_file(input_path, user_agent)
  File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
    req = urlopen(path_or_buffer)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

@jgcmarins Just confirming, did you reinstall tabula-py with the latest master branch? I haven't released it to PyPI yet.

I doubt you are using the same version of tabula-py since the stack trace shows the error of line 48 for localize_file but with the latest master branch, it should be line 59

req = urlopen(path_or_buffer)

File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file

Suffix should not be the problem since tabula-py automatically adds it.

Now it is working perfectly, thanks a lot for your help!!

Released 2.5.0 https://pypi.org/project/tabula-py/2.5.0/
Thanks for reporting!

@chezou thank you!!!