Section 7.2
AlexRamageScot1 opened this issue · 31 comments
Comments from the UK:
Suggested change for section 7,2:
Second: we appreciate the usefulness of 7.2.2, “Download of the whole data set”, but are not sure that “rel = enclosure” easily communicates this. We recognize that it is adopted from the Atom specifications & IANA link relation register – but we’re not sure how many users (developers?) find that ‘mainstream’.
Alex Ramage
Alex Ramage
Background information: bulk download was discussed in opengeospatial/ogcapi-features#230 and there "enclosure" was suggested.
@AlexRamageScot1 - What would you consider to be the "mainstream" link relation type for a downloadable distribution of a dataset? Could you provide good examples where that link relation type is used? We use enclosure
in the OGC API Features examples as this is the only applicable registered relation type that we have identified and I couldn't find another link relation type to reference a bulk download file that is widely used. If we missed something, we should also update the examples in the standard.
If my memory serves me right, the previous download services documents allowed for either service based download (WFS 2.0) or bulk download (ATOM), but did not require to implement both.
Having both options in a single API is an improvement, but I would suggest making the enclosure link optional, while "7.2.2. INSPIRE-specific requirement" seems to make it mandatory.
Point in case, the dataset might be large and already stored in a database, implementing an enclosure link would require preparing a full dataset dump in one or more formats, increasing space utilization. A possible alternative would be to build the result on the fly from the database, which would be feasible only with streaming oriented formats (e.g. GeoJSON, GML), but would not work in a synchronous request for formats that need to be fully written before being sent back to the client (e.g., geopackage, shapefile).
7.2.2 We raise similar concerns regarding implementation and anticipated performance challenges. Agree with suggestion of making the enclosure link optional.
If my memory serves me right, the previous download services documents allowed for either service based download (WFS 2.0) or bulk download (ATOM), but did not require to implement both.
The INSPIRE implementing rules require that a download service has an operation "Get Spatial Data Set", that returns the whole data set (see section 3). This corresponds to e.g.
- best practice 17, Provide bulk download: "Enable consumers to retrieve the full dataset with a single request."
- the definition of an open work (in the context of INSPIRE: replace "work" by "dataset": Open Definition 2.1: "The work must be provided as a whole and at no more than a reasonable one-time reproduction cost, and should be downloadable via the Internet without charge." (highlighting is mine). See also a discussion in this blog post: Why are bulk downloads of open data important?
Having both options in a single API is an improvement, but I would suggest making the enclosure link optional, while "7.2.2. INSPIRE-specific requirement" seems to make it mandatory.
If an implementation of OPAIF does not provide a mechanism to get a bulk download of the dataset served, that implementation cannot be an INSPIRE-compliant download service. The enclosure construction is one mechanism to provide a bulk download.
Point in case, the dataset might be large and already stored in a database, implementing an enclosure link would require preparing a full dataset dump in one or more formats, increasing space utilization.
The idea would be that a data provider provides a link to an already prepared dataset dump, that resides on another server. That could e.g. be an FTP-server, that has dataset dumps on it prepared by another tool, e.g. FME reading data from a database and uploading them every night.
See the example from the OGCAPIF spec:
The example in the current version of the Good Practice does not reflect this, I guess that should be updated to e.g. the following:
So the consequence for a server supporting OGCAPI would be that it must support adding an "enclosure" link, including a type
and preferably also a title
and length
. Not adding the possibility to prepare a bulk download.
See e.g. the following architecture:
Thanks for the update. I maintain the point that the requirement is taxing, but the INSPIRE directive is free to ask for whatever seems fit, and those implementing it will simply have to deal with the cost of it.
If an implementation of OPAIF does not provide a mechanism to get a bulk download of the dataset served, that implementation cannot be an INSPIRE-compliant download service.
That depends on the interpretation of "operation" / "request" / "response" in the legal text. If the terms are interpreted in the HTTP 1.1 sense, the statement above would be correct, but if a more general definition is used and iterating over the next
links would be considered as part of the operation / collecting the response, then there would be no requirement for an enclosure
link. It may still make sense to provide such a link, but providing the API would be sufficient for compliance.
Preparing a bulk download for data that is a static snapshot and never changes may not be an issue in general, but for data that changes (and I think that is a large part of the datasets) having a bulk download that always has the up-to-date data with the QoS requirements will often be challenging and costly plus the downloaded data will in general be outdated. I would find it surprising, if that was really the intent of the legal text.
A clarification, the quote is not complete: there was an extra sentence there (hightlighted here):
If an implementation of OPAIF does not provide a mechanism to get a bulk download of the dataset served, that implementation cannot be an INSPIRE-compliant download service. The enclosure construction is one mechanism to provide a bulk download.
Paging would be a second possible mechanism indeed. Paging was also discussed earlier indeed in the discussion paper, and it has been discussed earlier on the context of WFS, see Download Service WFS: StoredQuery and ResponsePaging for large datasets?. For large datasets this seems to be far from ideal (though probably compliant?).
Then we're back in the discussion regarding the use of INSPIRE: do we implement it to be compliant ("tick the box") or do we implement it because it makes sense to provide a spatial data infrastructure giving access to our data.
From the document:
The response of the /collections operation SHALL include at least one enclosure link that allows requesting a representation of the whole data set.
This sounds like a mandatory presence, while "one mechanism to provide bulk download" makes it sound like it would be one among other various valid options (hence, optional). If the enclosure can just point to the "items" resource and paging is allowed, then the issue is solved... however, any client can already download the full dataset that way, as part of the OGC Features API, no need to have an explicit "enclosure" link for it.
@heidivanparys - Just to clarify: I omitted the additional sentence as I don't think it is important for the discussion whether the legal text requires a "bulk download" (single downloadable file) for the Get Spatial Data Set operation or not, which was the subject of my comment.
If indeed there would be consensus that a paged download would meet the legal requirements then there would be no need for a requirement related to a bulk download and an "enclosure" link could just be in an example like on the OGC API Features standard.
If indeed there would be consensus that a paged download would meet the legal requirements then there would be no need for a requirement related to a bulk download
This would the best solution for providers of APIs if we'd like to avoid INSPIRE specific requirements as much as possible.
I do see the use case of a bulk download, especially for users of an API, so what about making an enclosure
link for bulk downloads at least a recommendation (if paging would meet the legal requirements)? This could also be an easy way to serve the data in different formats (like gpkg) for example. Or in other CRSes, for convenience.
@thijsbrentjens the "other CRSs" bit picked my interest.. how does one advertise the CRS of the specific enclosure link? Is there a machine processable way, or would this be delegated to the link title, and hence, to human interpretation?
how does one advertise the CRS of the specific enclosure link? Is there a machine processable way, or would this be delegated to the link title, and hence, to human interpretation?
@aaime you've got me there, I didn't think this through / check it. There does not seem to be a machine processable way indeed.
@AlexRamageScot1 - What would you consider to be the "mainstream" link relation type for a downloadable distribution of a dataset? Could you provide good examples where that link relation type is used? We use
enclosure
in the OGC API Features examples as this is the only applicable registered relation type that we have identified and I couldn't find another link relation type to reference a bulk download file that is widely used. If we missed something, we should also update the examples in the standard.
@cportele This was actually my comment - I didn't intend to imply there was a more mainstream link relation (although it seems to have generated some good discussion anyway), it was more whether enclosure is a readily understandable term for a bulk download just from a developer experience point of view. "Download" clearly has issues as a term both for INSPIRE and more generally, and "bulk download" might be more understandable but isn't registered.
So I can't say that I've got a better suggestion (which I acknowledge isn't helpful!), I was just concerned that if I was coding against a service I wouldn't guess that a link called enclosure would lead to a full bulk download.
Perhaps the guidance could recommend that the title of an enclosure link specifically makes clear its a bulk download of the full dataset?
I had a look at the service that is provided by pygeoapi community for the compliance test [1], https://demo.pygeoapi.io/cite , and there "download" is used:
[1] See https://www.opengeospatial.org/resource/products?display_opt=1&specid=1022 .
I would avoid rel: "download"
without registration of the link relation type, but how about using rel: "https://schema.org/downloadUrl"
as an alternative?
This might be more intuitive, uses a commonly known vocabulary, has a URI that can be dereferenced to find more information about the semantics and is at least broadly inline with the rules for Extension Relation Types in RFC 8288.
I would avoid
rel: "download"
without registration of the link relation type, but how about usingrel: "https://schema.org/downloadUrl"
as an alternative?
https://schema.org/downloadUrl is supposed to be used for software applications:
How about using rel: "http://www.w3.org/ns/dcat#downloadurl"
? (lowercase, thus not http://www.w3.org/ns/dcat#downloadURL
, in accordance with the recommendation in RFC 8288).
When extension relation types are compared, they MUST be compared as
strings (after converting to URIs if serialised in a different
format) in a case-insensitive fashion, character by character.
Because of this, all-lowercase URIs SHOULD be used for extension
relations.
http://www.w3.org/ns/dcat#downloadURL
is more specific than https://schema.org/contentUrl
, see also https://www.w3.org/TR/vocab-dcat-2/#dcat-sdo (both dcat:accessURL and dcat:downloadURL map to sdo:contentUrl):
dcat:downloadURL
schema:domainIncludes dcat:Distribution , schema:DataDownload ;
schema:rangeIncludes rdfs:Resource , schema:url ;
rdfs:subPropertyOf schema:contentUrl ;
DCAT is widely used in goverment portals and is a well-known specification, see e.g. Building a search engine for datasets in an open Web ecosystem (DOI 10.1145/3308558.3313685):
While Schema.org is widely used by search engines and other applications
to improve many Web-based tools that need to rely on
semantics of the data on aWeb page, it is not the only open standard
for describing dataset metadata. Several other standards exist, most
notably, the W3C Data Catalog Vocabulary (DCAT) [4]. Mappings
between Schema.org and the various extensions to DCAT are currently
under discussion at W3C and Schema.org. We found that at
the moment, only 2% of dataset descriptions (in JSON-LD, RDFa or
Microdata) use the DCAT standard while the rest use Schema.org.
However, the datasets that use the DCAT standard include hundreds
of thousands of datasets from government portals around the
world, and in particular portals with geo-spatial data. Therefore,
to get better coverage and to be inclusive of other standards, we
process both Schema.org and DCAT metadata, as long as the latter
is also represented syntactically in a supported syntax, to allow the
regular crawl processing to extract the triples (Section 5.7).
One more argument: given the efforts of the EU used on developing DCAT-AP, the re-use of elements of DCAT within in INSPIRE would also be good I think.
❓ It is not clear that this is the download URL for another distribution of the dataset. Is that an issue?
This might be more intuitive, uses a commonly known vocabulary, has a URI that can be dereferenced to find more information about the semantics and is at least broadly inline with the rules for Extension Relation Types in RFC 8288.
I like that approach.
❓ @cportele What exactly do you mean with "at least broadly inline"? Where would the approach not be compliant?
https://schema.org/downloadUrl is supposed to be used for software applications
Good catch, I overlooked this.
What exactly do you mean with "at least broadly inline"? Where would the approach not be compliant?
I said that because:
- The schema.org URI is not "under the control of the person or party defining it or be delegated to them".
- I did not use all lowercase, because the schema.org URL is not all lowercase.
For reference, the whole relevant clause from the IETF Proposed Standard RFC 8288:
2.1.2. Extension Relation Types
Applications that don't wish to register a relation type can use an
extension relation type, which is a URI [RFC3986] that uniquely
identifies the relation type. Although the URI can point to a
resource that contains a definition of the semantics of the relation
type, clients SHOULD NOT automatically access that resource to avoid
overburdening its server.The URI used for an extension relation type SHOULD be under the
control of the person or party defining it or be delegated to them.When extension relation types are compared, they MUST be compared as
strings (after converting to URIs if serialised in a different
format) in a case-insensitive fashion, character by character.
Because of this, all-lowercase URIs SHOULD be used for extension
relations.Note that while extension relation types are required to be URIs, a
serialisation of links can specify that they are expressed in another
form, as long as they can be converted to URIs.
And according to the IETF Best Current Practice BCP 14:
SHOULD This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.
So the proposal would be to use the extension relation type http://www.w3.org/ns/dcat#downloadURL
, using the case as defined in DCAT. Hence the URI would not be all-lowercase.
The reason to ignore the recommendation of IETF would be that we follow best practice Best Practice 15: Reuse vocabularies, preferably standardized ones in order to increase interoperability.
Are there any implications? We could repeat the requirement that extension relation types must be compared in a case-insensitive fashion.
The example would then become:
{
"links": [
{ "href": "http://my-org.eu/collections.json",
"rel": "self", "type": "application/json", "title": "this document" },
{ "href": "http://my-org.eu/buildings.gpkg",
"rel": "http://www.w3.org/ns/dcat#downloadURL",
"type": "application/geopackage+sqlite3",
"title": "Pre-defined data set download (GeoPackage)"}
]
}
❓ Then the next question is, what about the length
target attribute?
2.2. Target Attributes
Target attributes are a list of key/value pairs that describe the
link or its target; for example, a media type hint.They can be defined both by individual link relation types and by
link serialisations.This specification does not attempt to coordinate the name of target
attributes, their cardinality, or use. Those creating and
maintaining serialisations SHOULD coordinate their target attributes
to avoid conflicts in semantics or syntax and MAY define their own
registries of target attributes.The names of target attributes SHOULD conform to the token rule, but
SHOULD NOT include any of the characters "%", "'", or "*", for
portability across serialisations and MUST be compared in a case-
insensitive fashion.Target attribute definitions SHOULD specify:
o The serialisation of their values into Unicode or a subset
thereof, to maximise their chances of portability across link
serialisations.
o The semantics and error handling of multiple occurrences of the
target attribute on a given link.This specification does define target attributes for use in the Link
HTTP header field in Section 3.4.
The length
target attribute goes hand in hand with the enclosure
link relation type, so I guess it should not be used?
Given the use of http://www.w3.org/ns/dcat#downloadURL
, a natural choice could be the use of https://www.w3.org/ns/dcat#byteSize
. But if I understand the "token rule" from referred to in RFC 8288 correctly, see also RFC 7230, URIs can not be used as target attribute. Could byteSize
be used? Given that the context is http://www.w3.org/ns/dcat#
?
{
"links": [
{ "href": "http://my-org.eu/collections.json",
"rel": "self", "type": "application/json", "title": "this document" },
{ "href": "http://my-org.eu/buildings.gpkg",
"rel": "http://www.w3.org/ns/dcat#downloadURL",
"type": "application/geopackage+sqlite3",
"title": "Pre-defined data set download (GeoPackage)",
"byteSize": 472546 }
]
}
The link relation type enclosure
is also described at http://microformats.org/wiki/rel-enclosure . Here the description is the following:
By adding rel="enclosure" to a hyperlink, a page indicates that the destination of that hyperlink is intended to be downloaded and cached.
Right now, I'm inclined to say that we should stick to enclosure
and add a recommendation that the title of an enclosure link should make it clear that the link target is a bulk download of the full dataset? @AlexRamageScot1 @MichaelGordon What do you think?
Regarding some earlier comments:
how does one advertise the CRS of the specific enclosure link? Is there a machine processable way, or would this be delegated to the link title, and hence, to human interpretation?
@aaime you've got me there, I didn't think this through / check it. There does not seem to be a machine processable way indeed.
According to RFC 8288 and also according to the IETF Internet-Draft: JSON serialization for Web Linking, additional target attributes may be added, see also the comment above.
So if I understand that correct: we could define a crs
target attribute in INSPIRE:
{
"links": [
{ "href": "http://my-org.eu/collections.json",
"rel": "self", "type": "application/json", "title": "this document" },
{ "href": "http://my-org.eu/buildings.gpkg",
"rel": "enclosure",
"type": "application/geopackage+sqlite3",
"title": "Pre-defined data set download (GeoPackage)",
"length": 472546,
"crs": "http://www.opengis.net/def/crs/EPSG/0/3044"}
]
}
@aaime @thijsbrentjens What do you think?
@heidivanparys Adding the crs attribute works for me.
I'm re-reading this again and found the situation is worse than I initially imagined.
The requirement is to allow download of the entire dataset (potentially multiple collections) with a single link, however the allowable return types are just the following http://inspire.ec.europa.eu/media-types/application
The only true multi-collection formats I see there are:
- the geopackage, which needs to be fully constructed on disk before sending the response back,
- file-gdb, same as above
- oracle-dump
- GML, which allows multiple collections to be included in the same XML dump.
It is my understanding that GeoJSON tools prefer to have a single collection per document instead. Was hoping to just return a ZIP file with one GeoJSON file entry per collection, but don't see zip as an acceptable response type.
So, in the current situation, it seems that if one cannot prepare in advance a static download package, the best option for full dataset download would be GML... not the direction I was expecting :-D
Please clarify? As suggested previously, the preparation of a static download package is not always feasible.
Please clarify?
I think the simple explanation is that the contents of the registry is dynamic and should be updated when a need arises.
I tried to look up what the media type for a "ZIP file with one GeoJSON file entry per collection" could be and found the following:
- section 3.6 in RFC 6839, Additional Media Type Structured Syntax Suffixes
- How to refer to media types within ZIP files?
- Question on the usage of MimeType and MediaType for "zipped/container" distributions
On RFC 6839: it has status "informational", and is updated by RFC 7303, XML Media Types, a proposed standard, that does not say anything about zip, so I assume that what is said about zip in RFC 6839 is still useful. Again, it is "informational", but probably the best we have to refer to?
So application/geo+json+zip
should be added to the registry, following the above. @aaime @alexanderkotsev @cportele What do you think? Is this good solution?
If yes, then I am actually wondering, where does application/x-gmz
come from? Is that defined by OGC?
I'm re-reading this again and found the situation is worse than I initially imagined.
Does this only refer to missing media types or do we have other outstanding issues? To be honest, I had the impression we were getting quite close to a workable and standardised solution 😕 .
@alexanderkotsev Should we maybe organize a telecon to try to finalise this issue?
@heidivanparys - A few comments:
- 7303 updates section 4.1 of 6839, it does not update the +zip section, so the +zip section is still up-to-date.
- In my understanding 6839 states rules for media types with a suffix like "+zip". It does not say a suffix "+zip" may be added to any existing media type. Something like
application/geo+json+zip
would not be a valid media type. It would still need to be registered with IANA. Without registering specific media types with IANA, any solution will always be an INSPIRE-specific convention. But this is ok and it has always been the case, see all the registrations with the "x-" prefix to indicate an unregistered media type. - Google had registered
application/vnd.google-earth.kml+xml
for a KML document andapplication/vnd.google-earth.kmz
for a zipped version. There was a discussion at some point whether OGC should do something similar for zipped GML documents (application/gmz
). OGC never registered such a media type, but INSPIRE needed such a type, because of the interpretation that the legal text requires a file download for each dataset. That's whyapplication/x-gmz
was specified at that time.
@heidivanparys by "much worse" I mean the requirement to allow download of an entire dataset (multiple collections) instead of each single collection independently, and the associated issues.
A json+zip approach seems workable indeed. Not sure what's the benefit of downloading the entire dataset thought, when one might need only one collection out of many? Is it just convenience for those that "want it all"?
@cportele Ok, thanks for the comments.
Without registering specific media types with IANA, any solution will always be an INSPIRE-specific convention. But this is ok and it has always been the case, see all the registrations with the "x-" prefix to indicate an unregistered media type.
I'm not sure that INSPIRE should just invent a new media type here. Wouldn't it be best to ask for the advice of OGC on this matter? Would the Naming Authority be the right place?
I'm not sure that INSPIRE should just invent a new media type here.
My point was that INSPIRE has been doing this from the beginning. Yes, it is not perfect, but if it has been sufficient so far? The alternative would be that INSPIRE registers the additional media types in the vnd branch, which should be straightforward.
Wouldn't it be best to ask for the advice of OGC on this matter? Would the Naming Authority be the right place?
Yes, in OGC the Naming Authority is now responsible for registering media types needed for OGC standards (which would exclude, e.g., zipped GeoJSON, I think).
@heidivanparys by "much worse" I mean the requirement to allow download of an entire dataset (multiple collections) instead of each single collection independently, and the associated issues.
A json+zip approach seems workable indeed. Not sure what's the benefit of downloading the entire dataset thought, when one might need only one collection out of many? Is it just convenience for those that "want it all"?
- It is a requirement in INSPIRE that an entire dataset can be downloaded at once. However, ...
- ... it is actually a requirement that makes sense according to the following sources:
a. As also mentioned in the good practice document, this is in line with the recommendation Best Practice 17: Provide bulk download. Arguments for why this is a recommendation are provided there.
b. As also mentioned in the good practice document, this in line with what is decribed on Why are bulk downloads of open data important?. As the title indicates, arguments for why this is important are provided there.
c. In opengeospatial/ogcapi-features#230 , offering a bulk download did not seem to be perceived as a crazy idea. See e.g. this reaction: opengeospatial/ogcapi-features#230 (comment) - One could offer a download of the entire dataset (multiple) and download of each single collection, nothing prevents you from doing that to make it more user-friendly.
- Those that "want it all" would probably import the data in their own tool. So the main target group is probably not Web developers, but the "end users" of the data, those actually performing analysis on the data.
@thijsbrentjens You also expressed that you could see the use case for this, do you maybe have anything to add here?
@heidivanparys the way Chris Holmes suggested it, including asynch behavior, would make it feasible too, and open the door for geopackage, shapefiles and the like, outside of the case where the data is mostly static, or infrequently updated anyways.
As said during the meeting of today, resolving and closing this issue would be my first priority.
GitHub-label "waiting for input"
I tried to break this issue into several issues that address one topic, see the mentioned issues above.