IIIF/discovery

Support for Non IIIF resources

glenrobson opened this issue · 11 comments

This use case came up from a discussion in the IIIF training course last week.

I am a content aggregator and want to switch away from OAI-PMH to IIIF discovery to harvest digital resources. As well as IIIF Images, Audio and Video I also harvest PDFs and Word documents. If I am to move away from OAI-PMH I need the discovery solution to encompass both IIIF resources and non IIIF resources.

I believe its possible to reference activities for non-IIIF resources but how far do we go to support this use case? Do we include non-IIIF examples in the Spec or is this a recipe?

It's a IIIF Specification about making IIIF resources discoverable. I feel pretty strongly that the spec should not go into those details, which will be many, varied and sometimes unsolvable. PMH also does not support PDFs or word documents, or indeed anything other than metadata records in XML, so the described use case is somewhat dubious.

A recipe, note or other document describing a non-normative way to do non IIIF resources would be fine... but for 1.0 we need to demonstrate implementations, so tying 1.0 to non IIIF resources would be decidedly problematic.

A recipe, note or other document describing a non-normative way to do non IIIF resources would be fine... but for 1.0 we need to demonstrate implementations, so tying 1.0 to non IIIF resources would be decidedly problematic.

For 1.0 it would be good to have an agreed example of what this would look like either in a published recipe (probably unlikely with the timescale of 1.0) or just a recipe issue so I can point to an example when persuading people to make the switch.

PMH also does not support PDFs or word documents, or indeed anything other than metadata records in XML, so the described use case is somewhat dubious.

It's not unheard of for XML metadata records to point to PDF and Word documents :-).

I am not sure I understand "just a recipe issue so I can point to an example when persuading people to make the switch."
Are you saying that there are organizations interested in the Change API, but they would like to apply it to non-IIIF resources and the solution would be to persuade them to change to IIIF?
I'm not sure such a general advocacy should be in scope for this TSG. Especially if we have to persuade organizations who host Word documents :-)

PMH also does not support PDFs or word documents, or indeed anything other than metadata records in XML, so the described use case is somewhat dubious.

It's not unheard of for XML metadata records to point to PDF and Word documents :-).

I agree that OAI-PMH records may point to resources rather than XML metadata but you get into various complications with what dates apply to what. See, for example, http://www.dlib.org/dlib/december04/vandesompel/12vandesompel.html for a discussion of resource harvesting using OAI-PMH. OAI-PMH is outdated (see e.g. https://www.slideshare.net/simeonwarner/mind-the-gap-77336241) and Activity Streams properly solves the problem of which dates and types apply to which objects and documents through the indirection of Activity objects and, as necessary, Link objects.

Since IIIF Discovery is a profile of Activity Streams it seems that the straightforward answer to "how do I expose non-IIIF resources in a compatible way?" is "use Activity Streams". The current specification allows for a mixed or parallel streams by saying that object types SHOULD be Manifest or Collection, they could also be type="Document" or whatever else following the patterns given in Activity Streams, such as Example 111.

I agree with Rob's #82 (comment) that a note or recipe explaining this would be useful (for data providers to see what to do, and for aggregators to know what to ignore if they are looking only for IIIF resources). I also agree that it shouldn't be in the specification because that would be a significant expansion of scope and would dilute the key message.

It's not unheard of for XML metadata records to point to PDF and Word documents :-)

Indeed (and see Simeon's response to why that's not a great idea) but it's also not unheard of for IIIF Manifests to refer to PDFs and other content documents ... so they should just implement IIIF, thereby solving their problem in the same way as they use PMH, and without us having to describe how to implement Discovery for non-IIIF resources.

Recipe proposal from 2021-01-06 call:

  • Make a recipe issue that outlines the solution with an example
  • Make a recipe for it
  • Question as to how to find the recipe?

Spec:

  • Note in the specification that there might be non-IIIF content in a stream, or IIIF resources that the client isn't concerned with
  • Update crawling algorithm to check the type of the object of the activity

Registry:

  • Should be primarily IIIF resources to be valid for the registry

Related to #12

Here is an example containing a link to a PDF document and a link to a IIIF Image (without a manifest) for discussion:

https://glenrobson.github.io/iiif_stuff/activities/non-iiif.json

Questions this raises is:

  • What is the type for a PDF?
  • Is a IIIF image a dctypes:StillImage?

PDF MIME type is application/pdf?

Call of 2021-02-03: Considered done for the spec. Close when merged. Move recipe issue to cookbook repo

Done the following changes to the example (that will be added to the new recipe issue):

For PDF example:

  • Changed type to Text
  • added mediaType:"application/pdf"

For IIIF Image example: