IIIF/discovery

Descriptive properties for ActivityStream data

mixterj opened this issue · 12 comments

A list of AS properties that could be used to help aggregators/harvesters know determine if they are interested in crawling the AS.

Please note that all of the stringy properties are meant for harvesters/aggregators to ingest and build indexes around and are NOT intended for the IIIF Registry to index and make searchable.

Important (based on discussed issues)

  • attributedTo – organization that the AS is associated with
  • name – human readable label for AS (maybe Collection name or Organization name – not sure or very opinionated). Basically I want a string to index for searching (after an aggregator has parsed the AS).
  • summary – human readable text description for the AS. Again, just a bag of words to index for potential searching (after an aggregator has parsed the AS).
  • tag – list of ‘keywords’ or ‘subjects’ for the AS – connected to Controlled Vocabularies ideally since these are Objects that ‘require’ URIs.

Potential (not discussed but might be useful)

  • startTime – date the AS was first published (maybe use published?)
  • updated – date when the AS was last updated
  • generator – Thing that generated the AS – such as CONTENTdm – maybe more of interest if connected to a specific Activity in the AS??
  • audience - people interested in the AS

AS even redefined their own property for 'name'. How interoperable is that?

The intent of using AS terms here is to maximize the likelihood that generic producers and consumers can be used to produce and consume the resulting documents. By creating our own profile that imports other ontologies into the AS document, we're making certain that only IIIF organizations will be interoperable. This is exacerbated by the ongoing DXWG work to produce a new version of DCAT -- we would be working with a moving target, or instantly out of date.

Secondly, I think we need to be careful to distinguish between the various resources in play:

  • The activities that create, update and delete IIIF resources
  • The IIIF resources themselves
  • The information from which the IIIF resources are generated
  • The real world objects that the information is about, that the IIIF resources somehow represent.

What is useful for discovery is the third bullet above -- the machine readable information. Which is at least two steps removed from the ActivityStream.

And in terms of scope, coming up with a profile of DCAT in JSON-LD that is congruous with AS and IIIF design patterns seems like significant expansion of the charter into an area of limited value and not insignificant complexity.

Proposals:

  • I could definitely see a seeAlso from the AS Collection to other dataset level metadata such as DCAT or VOID. We could recommend that.
  • Use the list provided by Jeff as the starting point and see how far we get. If there's implementer feedback that something more detailed is important to be specified by IIIF, then we can take that on when we need to rather than front loading it at the expense of other tasks.

@mixterj I have tried to looking at your metadata fields (other than the contentious attributedTo), from the perspective of whether they would apply to the level of the AS itself or the datasets that the AS "publishes".

I think 'name', 'summary', 'tags' and 'audience' would be rather at the dataset level. Even when we want to describe the situation whereby an organization publishes an AS for someone else's dataset, I don't see this organization adding information for such fields, on top of what would be "inherited" from the dataset behind the AS. I mean, if we would like to have a 'summary' strictly at the AS level, to me the it would be something like "This Activity Stream publishes the resources from dataset X as well as updates about them". And I don't see much value in that.

The fields 'startTime', 'updated' and 'generator' seem to be much more at the level of the AS itself, in contrast.

@aisaac In principle I agree but I would suggest that the AS publisher could actually create an AS, that serves as a kind of Dataset in and of itself. For example, publishing a subest of items in a CONTENTdm collection (a single Dataset) that are Black & White photos or Manuscripts. In this situation, the AS publisher would in essence be creating their won unique view of the data and could apply unique subjects, names, etc. that may or may not be represented in the data publishers dataset metadata or at the very least would be more specific or curated for a given audience.

I certainly will not argue that name, summary, tag, and audience need to belong at the AS level but I would argue that an AS publisher can and would want to, based on some of the use cases we have talked about, want to create a curated stream of data and apply unique or more specific metadata properties to it. In this case maybe the AS publisher just needs to create their own Dataset description?

I am also slightly concerned about the consistency of IIIF data publishers to publish/maintain dataset descriptions and the quality of those descriptions. If Aggregators, who may be more motivated to describe and in some distinguishing detail the data they aggregate (collections of things), I do not see why they would not be encouraged to do so in a prescriptive way. As is stands now, the Aggregator needs to harvest every Manifest, hope there is metadata associated with it, parse the metadata, hope there is a IIIF Collection (of which I have not seen any in the datasets I have looks at), and then also maybe look for a Dataset description (VOID or otherwise). It just seems too inconsistent and haphazard to be really functional.

Finally, I think we are really just debating semantics - i.e. where and how would an Aggregator or AS Publisher describe the data encompassed in the AS.

@aisaac yes, I agree with these points. If this is a agreed upon approach though, I would push hard for a consistent way to hook all of these components together so the aggregator is not left guessing/hoping - I want to be an aggregator not an archeologist ;)

It sounds like the main components we have here are Manifests, Collection Manifests, ActivityStreams, and Datasets. Does that sound about right?

I guess my main concern is that we have a lightweight, consistent, and easily implementable solution. I will also admit that I am a firm believer that 'the perfect is the enemy of the good' ;)

If we're not providing guidance on use of the other AS terms, then we can see if and how people do add them. If there's a need and some emerging best practice, we can clarify in the future.

Propose that we can close the issue, as we can refer to a dataset description with context.

As discussed in the call on 19-09-2018 the group does not see an objection on following the currently proposed approach, far from it :-). But considering that solutions are not so clearly laid out in individual tickets (at least via the fact that they have an impact on several tickets, here for example #34, #35 and #38 ) it's preferable to wait and see what the solution looks like in the spec, and assess then how happy we are with the proposed pattern addressing the original case in this ticket.

Call of 2019-03-20 Agree to close, fixed - We get this with our own context, and can use the same pattern of extension and reuse as the presentation API and the Annotation model. Implementors can use whatever features they like without affecting the processing mode.