zooniverse/panoptes-python-client

Correct mime types for json subjects

srallen opened this issue · 6 comments

The TESS project should be using json files as one of their subject locations, with a json file extension and a mime type of application/json. Currently libmagic is not correctly detecting these file types and staging subjects have been uploaded as txt files with a mime type of text/plain.

We can set workflow configs to load a particular subject viewer, but I still would like to do validation on the expected json structure so we don't attempt to render something that has something wrong with its data. We typically expect text file subjects to be rendered just as plain text and are typically transcription projects, not data that should be plotted.

We'll need to think about how best to implement this. Presumably we'll need to check the filenames for a .json extension.

I think we have three options:

  1. Add a list of known file extensions/mime types. A lot of people seem to be having trouble installing libmagic, so maybe it would be best to only use it as a fallback if the file extension is unknown.
  2. Specifically add an exception for JSON. i.e. if the type is text/plain, check if the filename ends in .json.
  3. Add a way to manually override the mime type.

What do people think?

The problem files for the TESS project have a .txt extension, so we should try this with .json and see if that extension causes problems. I think it's correct behaviour to have text/plain when the extension is .txt.

I think option 1 the better option then falling back to libmagic if it's installed. Looks like mimetypes package? https://docs.python.org/3/library/mimetypes.html#module-mimetypes

I’ve run into this again today for SLSN. My workaround is to explicitly set the MIME type and file contents (apologies for my terrible Python):

subject.locations.append('application/json')
json_data = open('data/subject-1234.json', 'rb')
subject._media_files.append(json_data.read())
json_data.close()

This same problem also occurs with .svg files. They’re converted to .txt.

The Python CLI uses subject.add_location to add file names from a manifest to an upload, which also runs into this bug when libmagic generates the wrong type.