INSPIRE-MIF/gp-ogc-api-features

Bulk download: all collections at once and/or per collection?

Closed this issue · 4 comments

This issue addresses one of the topics discussed in #22 .

Question: would offering a bulk download for each collection be sufficient, instead of offering a bulk download that contains all the features of the dataset?

Input to the discussion is provided in #22 , starting from #22 (comment) , and more input is probably needed.

What would be the arguments from an implementation-centric point of view to prefer a bulk download per collection?

As for GeoJSON, the following was already mentioned:

It is my understanding that GeoJSON tools prefer to have a single collection per document instead.

In addition:

GeoJSON defines Feature collections and Features but does not contemplate the possibility of defining Feature types or associating a Feature to a feature type.

see also section 5.2 and section 5.3 in OGCs Testbed-12 JSON and GeoJSON User Guide.

Would we have different requirements/recommendations depending on the format used?

Offering a bulk download per collection would give the user more "clicks" to obtain the whole dataset. The number of clicks is known though, as all collections are described in the response of /collections.

More input to the discussion: providing subset for large datasets is discussed in Data on the Web Best Practices and in Spatial Data on the Web Best Practices.

Two extracts:

Making data available on the Web requires data publishers to provide some form of access to the data. There are numerous mechanisms available, each providing varying levels of utility and incurring differing levels of effort and cost to implement and maintain. Publishers of spatial data should make their data available on the Web using affordable mechanisms to ensure long-term, sustainable access to their data.

When determining the mechanism to be used to provide Web access to data, publishers need to assess utility against cost. In order of increasing usefulness and cost:

Bulk-download or streaming of the entire or pre-defined subsets of a dataset
Generalized spatial data access API
Bespoke API designed to support a particular type of use

Let's take a closer look at these options.

The download of a dataset - or a pre-defined subset of it - via a single HTTP request is mainly covered by these [DWBP] best practices:

[DWBP] Best Practice 17: Provide bulk download,
[DWBP] Best Practice 18: Provide Subsets for Large Datasets, and
[DWBP] Best Practice 19: Use content negotiation for serving data available in multiple formats.

Providing bulk-download or streaming access to data is useful in any case and is relatively inexpensive to support as it relies on standard capabilities of Web servers for datasets that may be published as downloadable files stored on a server. However, this option is more complex for frequently changing datasets or real-time data.

[DWBP] Best Practice 18: Provide Subsets for Large Datasets explains why providing subsets is important and how this could be implemented. Spatial datasets, particularly coverages such as satellite imagery, sensor measurement time-series and climate prediction data, are often very large. In these cases, it is useful to provide subsets by having identifiers for conveniently sized subsets of large datasets that Web applications can work with.

Best Practice 18: Provide Subsets for Large Datasets

If your dataset is large, enable users and applications to readily work with useful subsets of your data.
Why

Large datasets can be difficult to move from place to place. It can also be inconvenient for users to store or parse a large dataset. Users should not have to download a complete dataset if they only need a subset of it. Moreover, Web applications that tap into large datasets will perform better if their developers can take advantage of “lazy loading”, working with smaller pieces of a whole and pulling in new pieces only as needed. The ability to work with subsets of the data also enables offline processing to work more efficiently. Real-time applications benefit in particular, as they can update more quickly.
Intended Outcome

Humans and applications will be able to access subsets of a dataset, rather than the entire thing, with a high ratio of needed to unneeded data for the largest number of users. Static datasets that users in the domain would consider to be too large will be downloadable in smaller pieces. APIs will make slices or filtered subsets of the data available, the granularity depending on the needs of the domain and the demands of performance in a Web application.

Question: would offering a bulk download for each collection be sufficient, instead of offering a bulk download that contains all the features of the dataset?

Situations could also occur where one of the collections contains few features, e.g. a collection of sampling features (such as stations), and another collection contains many features, e.g. observations. Then perhaps an external bulk download of only the observations would be sufficient, and the user could retrieve the stations via one request to /collections/stations/items?

This is addressed in /req/pre-defined/enclosure