gdcc/pyDataverse

Add file-type to file upload

Opened this issue ยท 8 comments

Hello,
-While uploading datafile, the file-type is not recognised ( it is by default text/plain) while uploading. Even if contentType field is assigned (manually) using set command, the contentType again goes to default

Yes, I can confirm that this is the case - there is currently no way to pass the mime type to upload_datafile(); AND all files uploaded via pyDataverse end up with the mime type "text/plain" (not "type unknown" or "application/octet-stream", but "text/plain" specifically!).
I believe I know why this happens; see the issue IQSS/dataverse#8344 in the main Dataverse project.
In short: inside the upload_datafile() method, when the multi-part POST form is created, NO content type is specified for the upload. This apparently fools Dataverse into defaulting to "text/plain", without attempting to use its normal type detection methods. This defaulting behavior can and should be addressed on the Dataverse side. But it should be a good idea to fix it on the pyDataverse side as well; and a) provide a way to supply the mime type explicitly; and b) make it default to the standard application/octet-stream - a polite way to say "type unknown" - like curl does; which then prompts Dataverse to at least attempt to identify the file more accurately.
I will make a PR shortly for your consideration.

Fwiw, the solution proposed does not work with older versions of Dataverse (in our case 5.3). The solution we found at Odum was to add the mime type explicitly to the files.

If someone needs to support this with an older install, the work is here https://github.com/OdumInstitute/pyDataverse/tree/mime_type_upload . Note that to use this functionality you'll have to install a package in your project to get the mime type for your file. We use python-magic (and the underlying libmagic library).

I decided not to create a PR for this because my understanding of pyDataverse is that it doesn't try to support the intricacies of older Dataverse versions. But if this work is something that the community wants I can create an issue and a PR for it.

@matthew-a-dunlap pyDataverse tries to help everyone with any kind of Dataverse version, so your solution would be really nice to be merged. The problem is, I am not funded anymore, so there is no one right now maintaining this repo. And it would need some proper testing and reviewing before it can be merged (and then a release later on to merge it to master).

As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python

(please note that I made a quick/trivial PR addressing this issue 2 years ago - #142; I don't know/haven't checked if it's still relevant)

@lincolnsherpa hi! Nice seeing you in Braga last June. Great talk.

As @JR-1991 and I discussed (recording), we're pretty sure this has been fixed in the default (master) branch thanks to a switch from requests to httpx in #174. Are you interested in re-testing? Thanks!