gdcc/pyDataverse

direct upload to s3 store using Dataverse directupload api

Opened this issue · 10 comments

I have been working with the directupload api (https://guides.dataverse.org/en/5.4/developers/s3-direct-upload-api.html)
Its done in 2 passes. First puts the file into temp s3 storage, 2nd adds it to the dataset. As soon as I have a workable script I'll send it over.
I'm a bit confused about the post request. Documentation shows:
def post_request(self, url, data=None, auth=False, params=None, files=None):
"""Make a POST request.
But if I set auth=True (because I'm using an api key) I get an error of:
TypeError: 'bool' object is not callable

I checked my server log and found this:
#|2021-04-19T19:43:25.360+0000|SEVERE|Payara 5.2020.6|javax.enterprise.web.core|_ThreadID=66;_ThreadName=http-thread-pool::http-listener-1(3);_TimeMillis=1618861405360;_LevelValue=1000;_MessageID=AS-WEB-CORE-00037;|
An exception or error occurred in the container during the request processing
java.lang.Exception: Host is not set
at org.glassfish.grizzly.http.server.util.Mapper.map(Mapper.java:865)
at org.apache.catalina.connector.CoyoteAdapter.postParseRequest(CoyoteAdapter.java:496)
at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:309)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:238)
at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:520)
at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:217)
at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:182)
at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:156)
at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:218)
at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:95)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:260)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:177)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:109)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:88)
at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:53)
at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:524)
at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:89)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:94)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:33)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:114)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549)
at java.lang.Thread.run(Thread.java:748)
|#]

Jamie Jamison
UCLA Dataverse
jamison@library.ucla.edu

@jmjamison Which pyDataverse and Dataverse versions are you working on? And can you also share the code executed for the POST request?

Dataverse: 5.3 build 286-fcb5ce7
pyDataverse: 0.3.1

import pyDataverse
from pyDataverse.api import NativeApi
api = NativeApi(dataverse_server, api_key) <- set earlier
import subprocess as sp
from requests import ConnectionError, Response, delete, get, post, put
resp = api.get_info_version()
resp.json()

{'status': 'OK', 'data': {'version': '5.3', 'build': '286-fcb5ce7'}}

resp = requests.put(url_persistent_id, data=None, params=None, auth=(), files=None)
resp.json()

{'status': 'ERROR',
'code': 405,
'message': 'API endpoint does not support this method. Consult our API guide at http://guides.dataverse.org.',
'requestUrl': 'https://dataverse.ucla.edu/api/v1/datasets/:persistentId/uploadurls?persistentId=doi:10.25346/S6/T4LHZF&size=10000000',
'requestMethod': 'PUT'}

Also tried:
url_persistent_id = '%s/api/datasets/:persistentId/uploadurls?persistentId=%s&size=%s' % (dataverse_server, persistentId, str(size))
r = requests.post(url_persistent_id,
headers={
"X-Dataverse-key": "$API_TOKEN"
},
cookies={},
auth=()
)

{'status': 'ERROR',
'code': 405,
'message': 'API endpoint does not support this method. Consult our API guide at http://guides.dataverse.org.',
'requestUrl': 'https://dataverse.ucla.edu/api/v1/datasets/:persistentId/uploadurls?persistentId=doi:10.25346/S6/T4LHZF&size=10000000',
'requestMethod': 'POST'}

Is there anything else I should add?

@jmjamison Is this still an issue / problem? Am on parental leave until may 2022, so my time for pyDataverse is very, very limited.

Apologies, I didn't realize you were on parental leave. The issue exists but I can use other methods for direct uploads. Enjoy the time with your youngster.

Update: I left AUSSDA, so my funding for pyDataverse development has stopped.

I want to get some basic funding to implement the most urgent updates (PRs, Bug fixes, maintenance work). If you can support this, please reach out to me. (www.stefankasberger.at). If you have feature requests, the same.

Another option would be, that someone else helps with the development and / or maintenance. For this, also get in touch with me (or comment here).

FWIW: There was some recent work on python support for direct upload in IQSS/dataverse.harvard.edu#194 - not multipart yet and not associated with pydataverse but possibly useful and possibly something to mine for pyDataverse.

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python