HTTP Error 400 Bad Request when reading from AWS S3 PreSigned URL
jgcmarins opened this issue ยท 6 comments
Summary of your issue
I am trying to read a pdf from AWS S3 PreSigned URL and I experiencing the following error:
Traceback (most recent call last):
File "main.py", line 3, in <module>
df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH")[0]
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/io.py", line 311, in read_pdf
path, temporary = localize_file(input_path, user_agent)
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
req = urlopen(path_or_buffer)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Check list before submit
-
Did you read FAQ?
-
(Optional, but really helpful) Your PDF URL: it's a AWS S3 PreSigned URL, sorry, can't share, but the PreSigned is working because I am able to access the PDF through the browser
-
Paste the output of
import tabula; tabula.environment_info()
on Python REPL:
Python version:
3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0]
Java version:
openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)
tabula-py version: 2.4.0
platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
uname:
uname_result(system='Linux', node='DESKTOP-KVLP8PC', release='5.10.16.3-microsoft-standard-WSL2', version='#1 SMP Fri Apr 2 22:23:49 UTC 2021', machine='x86_64', processor='x86_64')
linux_distribution: ('Ubuntu', '20.04', 'focal')
mac_ver: ('', ('', '', ''), '')
None
- Paste the output of
python --version
command on your terminal: Python 3.8.10 - Paste the output of
java -version
command on your terminal:
openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)
- Does
java -h
command work well?; Ensure your java command is included inPATH
- Write your OS and it's version: Windows 10 with WSL
What did you do when you faced the problem?
Tried to search on both Google and GitHub issues to see if anyone else is facing the same issue and found nothing.
Code:
df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH")[0]
df.to_csv('./test.csv', encoding='utf-8')
print(df)
Expected behavior:
CSV with PDF data.
Actual behavior:
Traceback (most recent call last):
File "main.py", line 3, in <module>
df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH")[0]
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/io.py", line 311, in read_pdf
path, temporary = localize_file(input_path, user_agent)
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
req = urlopen(path_or_buffer)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Related Issues:
Added use_raw_url
option for the case to avoid the Presigned URL. Can you try it once you have a time?
Hey @chezou, thank you for the quick fix, but I am still seeing the same error :(
I've noticed that in the URL you added to the PR example, there's a .pdf
suffix in the file name: https://tabula-py-test.s3.ca-central-1.amazonaws.com/data.pdf?SIGNED_HASH
Meanwhile, in mine, there's no .pdf
suffix: https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH"
Also, in yours, there's a response-content-disposition=inline
parameter.
Do you think any of those could make any difference?
Code with use_raw_url
import tabula
df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH",
pages="all", use_raw_url=True)
Stack trace
File "main.py", line 3, in <module>
df = tabula.read_pdf("https://my-bucket-name.s3.amazonaws.com/v1/user/62e970a844d091d90069ab7d/file/62ed00ce74b8a716407975d6?SIGNED_HASH",
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/io.py", line 311, in read_pdf
path, temporary = localize_file(input_path, user_agent)
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
req = urlopen(path_or_buffer)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
@jgcmarins Just confirming, did you reinstall tabula-py with the latest master branch? I haven't released it to PyPI yet.
I doubt you are using the same version of tabula-py since the stack trace shows the error of line 48
for localize_file
but with the latest master branch, it should be line 59
Line 59 in 5dac208
File "/home/alunix/code/ddc-reader-python/ddc-reader/lib/python3.8/site-packages/tabula/file_util.py", line 48, in localize_file
Suffix should not be the problem since tabula-py automatically adds it.
Now it is working perfectly, thanks a lot for your help!!
Released 2.5.0 https://pypi.org/project/tabula-py/2.5.0/
Thanks for reporting!