Uploading files larger than 2GB does not work
Opened this issue · 10 comments
Bug report
1. Describe your environment
- OS: Debian 10 (buster) 64bit
- pyDataverse: 0.3.1
- Python: 3.7.3
- Dataverse: 4.20-dev
2. Actual behaviour:
Trying to upload a file larger than 2GB causes an error. Uploading the same file using curl works fine.
3. Expected behaviour:
To upload the file. Or at least say that this will not work because the file is too big.
4. Steps to reproduce
The program and stack trace are as follows:
from pyDataverse.models import Datafile
from pyDataverse.api import NativeApi
df = Datafile()
api = NativeApi(SERVER_URL,API_KEY)
ds_pid=ID_OF_EXISTING_DATASET
df_filename = PATH_TO_FILENAME_OF_BIG_FILE
df.set({"pid": ds_pid, "filename": df_filename})
api.upload_datafile(ds_pid, df_filename, df.json())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/dist-packages/pyDataverse/api.py", line 1685, in upload_datafile
url, data={"jsonData": json_str}, files=files, auth=True
File "/usr/local/lib/python3.7/dist-packages/pyDataverse/api.py", line 174, in post_request
resp = post(url, data=data, params=params, files=files)
File "/usr/lib/python3/dist-packages/requests/api.py", line 116, in post
return request('post', url, data=data, json=json, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.7/http/client.py", line 1260, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1069, in _send_output
self.send(chunk)
File "/usr/lib/python3.7/http/client.py", line 991, in send
self.sock.sendall(data)
File "/usr/lib/python3.7/ssl.py", line 1015, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.7/ssl.py", line 984, in send
return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes`
5. Possible solution
Some possible solutions streaming upload or chunk-encoded request) are written here:
I am not very versed in python, but I will try to fix this in the following week, and submit a pull request. If I fail, feel free to fix this bug!
Forgive me it this isn't relevant. Uploading really large files - in my case Lidar data - I use an s3 bucket set for direct-upload. Now that doesn't work with pyDataverse but for uploading really large files individually a direct-upload bucket is helpful.
I understand that this is not relevant for you. However, if the dataverse installation in question does not use an s3 storage backend, then this becomes instantly relevant.
The issue is, i am on parental leave right now (until may 2022), and we at AUSSDA do not use S3 - so I can not test this.
The best way to move forward, would be to resolve the issue by yourselves.
We also just ran into this. From looking at the Dataverse side, uploads using multipart/form-data
should be available.
For the sending side, looks like "requests-toolbelt" has something we could use: https://toolbelt.readthedocs.io/en/latest/uploading-data.html
Maybe it would be good to detect the filesize and either go for a normal upload when <2GB or multipart for larger?
(I don't have the capacity right now to look into this.)
Can this bug be reproduced at https://demo.dataverse.org ? Currently the file upload limit there is 2.5 GB, high enough for a proper test, it would seem.
Also related to #136
Update: I left AUSSDA, so my funding for pyDataverse development has stopped.
I want to get some basic funding to implement the most urgent updates (PRs, Bug fixes, maintenance work). If you can support this, please reach out to me. (www.stefankasberger.at). If you have feature requests, the same.
Another option would be, that someone else helps with the development and / or maintenance. For this, also get in touch with me (or comment here).
I know I shall not expect movement here (unless someone else picks it up or we find funding).
But to not let newly found insights slip away and for what it's worth: how about exchanging requests
for aiohttp
?
I know aiohttp is much larger as a dependency, but it does support multipart uploads. https://docs.aiohttp.org/en/stable/multipart.html
Not sure that helps out-of-the-box since our multipart direct upload involves contacting Dataverse to get signed URLs for the S3 parts, etc. FWIW, I think @landreev implemented our mechanism in python, it just hasn't been integrated with pyDataverse.
@qqmyers you are right - direct upload needs more. Maybe one day we also extend pyDataverse for this.
That said: this issue here is about uploading with simple HTTP upload via API. As requests
is not capable of using multipart upload, you are limited to 2GB filesize (same limitation as our SWORD 2.0 library). The API endpoint itself is capable of using multipart uploads.