HydraCG/Specifications

Requiring proper ranges on supported properties

tpluscode opened this issue · 10 comments

Description

Here is a comment from HydraCG/api-examples#4 which unexpectedly reveals a point of contention regarding supported properties:

But, the RDFS ranges are IMO the minimum requirement for a supported property

I'm keen to agree with you, but having more details explicitly available is something an API may not provide.

Without them the client has only an option to fill the operation request with plain literals.

Untrue - knowing a property IRI client could i.e. de-reference it hoping to receive more details on that property, including it's range/domain declarations. It's linked data after all

Originally posted by @alien-mcl in HydraCG/api-examples#4 (comment)


TL;DR;

I, rather strongly, propose that supported property ranges are made explicit in supported properties as presented below

{
  "@type": "SupportedProperty",
  "property": {
    "@id": "schema:duration",
    "rdfs:range": "xsd:Duration" // <- WE NEED THIS
  }
}

Bottom line is that by choosing any different we're opening a pandora's box for misinterpretation, slow clients and confusion.


I'd like to address the two replies above as I find them damaging to the practicality of Hydra:

I'm keen to agree with you, but having more details explicitly available is something an API may not provide.

I do not understand this a single bit. The API is the precise, best place for having this information. We're talking about local schemas. Forget vocabularies/ontologies for a moment.

Any API being developed, unless they are a sink for (un|semi)-structured data, would likely operate on a fairly well-defined model. Think Java/.NET DTO classes in particular, with their strongly typed properties. Same applies to any information system anyway. The server rarely expects a free-for-all where any client can post any value they please to the server.

Of course, a client may not understand the semantics of a particular datatype but in any case it is in both client's and server's best interest to always provide at least that information.

Untrue - knowing a property IRI client could i.e. de-reference it hoping to receive more details on that property, including it's range/domain declarations. It's linked data after all

This is just plain wrong and not only on one level.

First, this is just too overburdening the client. You could have a multitude of properties and types from various vocabularies used in an API. Would you really have the client dereference them all? This is precisely the verbosity of REST APIs that drove people away and into GraphQL.

We should strive for self-containedness wherever possible, especially in an element so fundamental as attribute DATA TYPES.

Second, the ranges defined in an ontology may be something not what the API really wants in their resources. As long as they are not breaking semantic rules it is fine. You will say that it can always be overridden and you will be correct. But why not make it straightforward, but simply requiring this information to be there? Any other choice makes the client logic complex.

Lastly, not to mention that some (quite a few actually) vocabularies don't even dereference or are poorly maintained. Some are served with wrong media type. Some are on slow servers. This would be a death trap for any application.

Semantic Web community has been having this problem for years. Hence the projects like LOV exist. Heck, my company has even started work on similar tooling to closet's very gap!

Thanks for your detailed explanation, I know understand better why you find it important.

From my point of view the rdfs:range does not really matter, because I have clients in mind, that are "domain specific", so they are not hard-coded to a specific API, but yet hard-coded to a specific domain, i.e. set of vocabularies / terms they can handle.

This means those clients would already know how to handle schema:duration, without additional information in either the API doc or the vocabulary itself.

It appears to me, that data types would mostly be required by real generic clients that have no clue about the API domain at all. But even those could fall back to a simple textfield if nothing is defined. Of course it would be better to have an explicit range, but I am not sure if this justifies to enforce it. The more complex it is to design a Hydra API the less likely it becomes that people adopt it.

Okay, I see your point but any kind of hard-coding introduces coupling.

This means those clients would already know how to handle schema:duration, without additional information in either the API doc or the vocabulary itself.

This is the definition of "out-of-band" information, which exists outside of the hypermedia controls. This comes at the cost of limited evolvability for the clients.

Of course, as @alien-mcl points out, it is theoretically possible to dereference the vocabulary and have client do the heavy lifting but it's hardly an options for the reasons I stated above.

It appears to me, that data types would mostly be required by real generic clients that have no clue about the API domain at all.

Not necessarily. I mention API evolution. The more information the API provides explicitly (or are explicitly part of the media type's processing rules), the more resilient the clients will be to possible changes to the API. In other words, it should be harder to introduce breaking changes.

And such possible change could be switching one vocabulary for another, which should be transparent to the clients in most cases.

The more complex it is to design a Hydra API the less likely it becomes that people adopt it.

Funny you should say that, because I think that the logic here is the exact opposite. See, requiring the range actually simplifies the overall Hydra experience IMO.

On one hand you add a property and call it a day.
The alternative is either to create coupling through hardcoded behaviours or introduce a whole lot more complexity to the clients (dereferencing and/or maintaining information about vocabularies).

And bear in mind that typically there is an N:1 relation between clients. Multiple clients are affected by a change on the server. If it's possible to have the server a bit more complex or strict to avoid that, it should be a default choice.

Without them the client has only an option to fill the operation request with plain literals.

Untrue - knowing a property IRI client could i.e. de-reference it hoping to receive more details on that >> property, including it's range/domain declarations. It's linked data after all

This is just plain wrong and not only on one level.

I just stated that client has another option, without deliberating about feasibility.

Okay, I see your point but any kind of hard-coding introduces coupling.

Generally, I don't expect an API to provide me exact description of a property from these vocabularies: schema.org, RDF, RDFS, OWL, FoaF (maybe I could find few other) - these are either so popular or enforced that I acknowledge theme as standard vocabs a good client should understand. The property IRI should be enough - this is where API Documentation and general pre-processing stage should come in. We can argue about which vocabs should be treated that way, but there are exceptional vocabs that should be understood by the client and I do not acknowledge it as coupling.

In case of locally minted vocabs - I agree that API should be more exact, but again - throwing a property IRI that links to the locally exposed vocab is soooo easy to implement, I'd expect client to do some smart things.

You could have a multitude of properties and types from various vocabularies used in an API

It may have a performance impact, but doing it somehow smart (mentioned pre-processing stage and very beginning of the client's existence) is tempting. Remember that several vocabs are using hash URL approach for their terms - this saves a lot of additional requests.

I wouldn't throw out this possibility, but I agree - the more of the description provided by the API the better.

Generally, I don't expect an API to provide me exact description of a property from these vocabularies: schema.org, RDF, RDFS, OWL, FoaF (maybe I could find few other)

Sounds like a fair compromise. But I would explicitly recommend built-in support and warn against the pitfalls of handling this at runtime as the default strategy.

Unfortunately there are still some problems.
First, some vocals like schema may not provide any ranges other than in textual descriptions. Who would maintain this information for the clients? Would we have to handcraft it and include as part of the Hydra recommendation?

Having read your comments, I think I would propose something similar yet reversed.

  1. For shared vocabularies the server SHOULD provide explicit property ranges.
  2. For local vocabularies, the server MAY omit property ranges but the vocabulary MUST accurately describe them. In such case, the client MUST dereference the terms which are not explicitly annotated with rdfs:range
  3. It is NOT RECOMMENDED to skip ranges for external vocabularies unless the client is explicitly implemented to handle (local caches, etc) them without incurring performance penalty and/or issues with dereferencing in the first place.

This way we explicitly recommend against relying on dereferencing third party vocabs, which will be an instant problem for any unsuspecting "client" application.

I think those three points capture my viewpoint well on this, @tpluscode. I agree with you. I am, however, also skeptical about giving too much precise information about the properties to clients, because that encourages strict validation, leading to brittle integrations – sort of like XSD, WSDL and SOAP all over again.

Don't these problems with WSDL etc stem from client code generation? Aren't we safe as long as the client always uses the latest descriptions from Api Doc dynamically?

We are safer, at least. I'm not sure the problem of brittle clients is something we can mitigate 100% no matter how hard we try, but I agree that with dynamic loading of the Api Doc, we are in a much better place than is possible with WSDL.

For me SHOULD in point 1 is a bit to strong - I'd go with MAY. Why?

The schema.org is has some issues. There is an RDF serialization available directly from schema.org domain, but it does define ranges with it's own terms (schema:rangeIncludes) and not with rdfs:range. Having SHOULD there may end up with crafting custom ranges with may be actually wrong.

As for WSDL - API Documentation is just a helper to give some hint for clients on the pre-computation stage. Spec mentions that client should always acknowledge responses in the runtime as these may not provide data exactly as the API documentation states. The API is supposed to provide hypermedia after all. In SOAP environment WSDL was the only option.

Okay, so I'm revisiting this currently and trying to actually implement a form element around hydra supported operation.

Having SHOULD there may end up with crafting custom ranges with may be actually wrong.

Actually, crafting custom ranges may be a very common thing a Hydra API would do. Consider foaf:knows, which has range of foaf:Person. This range may be too broad for the client (and server). In fact, a server may never explicitly use foaf:Person but insist on foaf:knows. It may even use different ranges for this property in different contexts (ie. multiple classes which effectively subclass foaf:Person, even if only implicitly).

That said, I come to accept that MAY could be a better choice and we may want to encourage some kind of lookup from a trusted source.

Why?

Because I realised one additional hurdle: object vs datatype properties. Would you agree that it's important for the client to get precise information whether a property's expected value is a resource or a literal? That is additional information which is best looked up in the source and unlike the ranges, makes completely no sense to duplicate that information for shared vocabularies. Especially that there is more that one ways this information could be conveyed.

PS

Still a bit worried about the size of schema.org though.

Let me close this. I have reconsidered and come to a conclusion that a more Linked Data approach could in fact be more beneficial in the long run.