Make it possible to Import IIIF collections
Abbe98 opened this issue · 13 comments
IIIF and the IIIF Presentation API are used by many GLAM institutions and the ability to import records IIIF Collections would greatly reusers who wish to clean GLAM data or users of the Commons extension.
Proposed solution
Given the collection root URL, an importer would traverse its content and fetch data from the various IIIF manifests in it.
Additional context
Thanks @Abbe98 this sounds like a really interesting use case. I can definitely see how we might have some specific tools/functions to support IIIF but I'm less sure what these might look like in reality.
Working from the start, could you say a bit more about how you'd see this working? For example the example collection you have posted, contains a set of further collections, which contain a mixture of more collections and manifests. What would the resulting OpenRefine project look like? What might be a typical data cleaning task within the resulting project?
I would intuitively keep this issue in the CommonsExtension repository unless there are things OpenRefine's side that need to be changed for such an importer to be implemented there.
For example the example collection you have posted, contains a set of further collections, which contain a mixture of more collections and manifests.
@ostephens in my opinion, it would only fetch the data in manifests and each manifest would become a record, as one manifest is generally representing a single media file.
would intuitively keep this issue in the CommonsExtension repository unless there are things OpenRefine's side that need to be changed for such an importer to be implemented there.
Yeah, this probably shouldn't be in core. Not sure the CommonsExtension is the right place either considering that the features aren't dependent on each other.
@ostephens in my opinion, it would only fetch the data in manifests and each manifest would become a record, as one manifest is generally representing a single media file.
Checking my understanding, if the user were to specify the root URL "https://lbiiif.riksarkivet.se/collection/kartor-och-ritningar" the importer would be required to retrieve the content of the "items" array found at that URL and then:
- If the Item has "type"=="Manifest" store the information in a row in a project
- If the Item has "type"=="Collection", use the URL in the Collection ID property as the root URL and keep going
Through this process the importer would work through all Collections and Manifests that are discoverable from the original root URL and eventually end up with a project that contains all the Manifests that were found?
Have I understood the intention correctly?
Through this process the importer would work through all Collections and Manifests that are discoverable from the original root URL and eventually end up with a project that contains all the Manifests that were found?
Have I understood the intention correctly?
@ostephens yes. I guess one might want to implement some optional limits(max x number of levels, max u number of records, etc).
Thanks @Abbe98. I'm not a IIIF expert, but I think it's allowed for collections to include items that are from anywhere online? So we could be ending up doing some extremely large-scale crawling here? (this could also be limited in some way of course - such as allowing the user to specify a domain as well as number of levels)
@Abbe98 in the case of finding a manifest, how would you want the information in the manifest stored in an OpenRefine row? To take an example from your root we have the collection ID https://lbiiif.riksarkivet.se/collection/arkiv/pZdxhTy01Y7BRBFEIaUwL4
which contains the manifest:
{
"id": "https://lbiiif.riksarkivet.se/arkis!R0002353/manifest",
"type": "Manifest",
"label": {
"sv": [
"1:1 [Det långa parlamentets bortdrivande av Cromwell 1653 20/4. Samtida illustration (på engelska) och en tillhörande holländsk text.]"
]
}
}
What would the row/record stored in OpenRefine look like in this case?
I think it would be great to discuss with the (very active) IIIF community how they'd like this to be built, and maintained over the longer term.
I agree with @wetneb that this should be moved to a more appropriate repository.
The example collection manifest looks like JSON-LD, so it's already supported by OpenRefine, but with the limitations inherent in mapping tree-shaped (JSON & XML) formats to a rectangular grid.
The universe of JSON applications is obviously way too big to be building specific support into OpenRefine for each of them.
So I have transfered it to the CommonsExtension repo, where it seems to be indeed duplicating #19 - not sure which one people want to keep?
I'm not sure if this is the right place after all. It may very well be that the IIIF community would mainly prefer to use IIIF integration in OpenRefine for generic data cleaning (not for Wikimedia Commons import)! IMO they are the ones to say/decide.
I would strongly suggest a bit of user research, asking potential users about their primary predicted use cases.
My intent when I created this issue had nothing to do with Wikimedia Commons. While I too agree that it should be in a separate extension(I believe half of the core extensions should be moved from core...) but thought high-level extension request lived in core's issue tracker.
I have created a wiki page to list some extension requests and listed IIIF there:
https://github.com/OpenRefine/OpenRefine/wiki/Extension-ideas#iiif-import
Pages on the wiki aren't super visible, so it's not clear to me that's the best place to put it. The main issue tracker is also an option, but not ideal either since they are not meant to be implemented in that repository. Maybe it could also be on the openrefine.org website, but then it's probably harder to edit?