w3c/dxwg

How to express distributions provided as compressed files

nichtich opened this issue · 20 comments

The dcat:downloadURL of a dcat:Distribution can point to compressed files (.zip, .gz...). What is the data format in this case? Stating that it is a ZIP-File will not help much: the more interesting information is what's in the archive. For dcat:mediaType the +gzip MIME TYPE suffix can be added but what to put into dct:format - an identifier of the archive format or of its content (I'd prefer the latter)?

I think you're asking a bit much here! If there are a bunch of different things all zipped together (say PDF files, CSV and images), then what are you going to do if you want to communicate the content, provide a list of media types like this: application/pdf,text/csv,image/png? How is this helpful for machine purposes? I written inventory of dataset content would do for a person (in the Distribution'sdct:description).

dcat:mediaType should be used with a Media Type to indicate the compression types unless the compression type itself doens't have an IANA Media Type (a new BZip format or something) in which case dct:format should be used.

See also: #54

@akuckartz thanks for the reminder about #54 which I recall now that you mention it!

Is this issue a duplicate? I think it is. Or, at least, the scope of #54 covers this issue. I move to close this issue in favour of dealign with compression issues in #54.

@nicholascar I actually these there are 2 separate issues and their combination, but even in #54 they are a bit mixed up.

  1. Pure compression, e.g. I have an RDF TriG Distribution - one .trig file, and and I compress it to save space on my web server, creating a .trig.gz file and I want to be able to describe this distribution properly. This can be done in multiple ways combining web server techniques and DCAT. One way could be to serve the .trig.gz file directly, then I need to be able to say in DCAT that the distribution is RDF TriG with Media type application/trig and it is compressed using gzip with Media type application/gzip. There is the Media type extension +zip which is however not specific enough (zip and gzip are two different things). Another way of doing this is saying that we leave compression to HTTP server and client (restriction to HTTP), the server can use gzip_static to serve the .trig.gz file from its file system and the client decompresses it transparently in the HTTP layer. This means the Distribution still points to the .trig file, the media type is application/trig, and it is completely opaque to the user.
  2. Packaging of files (like tar does). This is a separate use case where a set of files (homogeneous or heterogeneous) is packaged into one. The question here is whether we recommend this or not, and if we do, how do we describe what is inside. Again, one way would be to say that we only recommend homogeneous packages (e.g. a set of .xml files valid against a single XSD), and provide properties for saying that the file inside is application/xml and the package is TAR (there is no official Media type for that, unofficially `application/x-tar', and there is a file type for it). I would disallow (not recommend) having a package of heterogeneous files as one distribution, and recommend to split them into multiple distributions, so that each can be described properly.
  3. Combination of these two. There should be guidance for this, e.g. a .tar.gz file containing a set of XML files. There we need to be able to describe that they are XML files conformant to an XSD schema, packaged using TAR and compressed using gzip.

Thanks for the summary. My use case is the first one. Here are two popular examples I make use of:

It's also common to server zip file archive with a single file although this does not make much sense.

Anyway, I doubt that providers will change their web server settings just to make dcat ontology happy. One example of a dataset with multiple distributions from DNB:

https://data.dnb.de/opendata/GND.hdt.gz
https://data.dnb.de/opendata/GND.jsonld.gz
https://data.dnb.de/opendata/GND.ttl.gz
https://data.dnb.de/opendata/GND.rdf.gz

As I wrote in #54 (comment), the solution with +zip is just for the simple use case of a zipped-up distribution file. This solved an issue that was brought up in the work on the European DCAT-AP.
A more general issue is the use of a second property to indicate the format of the file(s) within a compressed or packaged file, like the property adms:representationTechnique.
I think this approach meets the requirements , or do people see a need to have more than two levels?

@makxdekkers Let's see on examples of dcat:Distributions for each case.

Note that neither the File Types codelist mandatory in DCAT-AP nor the official IANA Media Types list are exhaustive, therefore we need to use both.

The simplest case is an uncompressed CSV file (which is actually served with HTTP gzip compression when supported - transparent to DCAT). There is a CSV on the Web JSON descriptor of the CSV file in 2007.json:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> .

Now let's add the explicit .gz compression of the CSV file and let's assume I use adms:representationTechnique for the inner type:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/GZIP> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    adms:representationTechnique <http://www.iana.org/assignments/media-types/text/csv> .
  1. There is no way to specify the original http://publications.europa.eu/resource/authority/file-type/GZIP file type (and media type), so people searching for CSV files will not find this distribution.
  2. The fact that the distribution is CSV is far more interesting than the fact that it is a GZIP file. I wonder if dct:format and dcat:mediaType should reflect the inner file and rather the compression technique should be specified in adms:representationTechnique so that people searching for CSV files would only need to check one property (dcat:mediaType), not two. This is also related to the next point.
  3. The dcterms:conformsTo specifies the JSON descriptor of the inner CSV file, not the gzip file. This supports the point that the whole distribution description should be focused on the inner file, and the compression should be indicated on top of that.

I would therefore suggest (the actual new properties can actually be different, if appropriate ones are found):

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

    dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> ;

Next, the packaging of multiple files. Let's assume that we have a TAR package with a set of homegenous CSV files inside (e.g. for data for individual years). Note that ZIP can be used here as well as packager, not compression:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:format <http://publications.europa.eu/resource/authority/file-type/TAR> ;
    # There is no IANA dcat:mediaType for TAR
    adms:representationTechnique <http://www.iana.org/assignments/media-types/text/csv> .

The same points as with the gzip compression above apply here. In addition:

  1. There is no indication that there are multiple files in the package. This could be solved by introducing separate properties for packaging technique and for compression technique. The use of the packaging property would indicate there are multiple files inside.

Therefore, I would suggest:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
    dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> .

Finally, the packaging and compression case. This means multiple CSV files, and for instance TAR packaging and GZIP compression, or ZIP packaging and ZIP compression. Here we need to specify 3 levels - CSV, TAR and GZIP. So I would suggest:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/data.tar.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/data.tar.gz> ;
    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

    dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> ;
# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
    dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> .

This gives the publishers the possibility to describe the distribution properly, and the original DCAT properties are still used for the most important format, which is the innermost one.

Of course the dcat:compressionMediaType, dcat:compressionFormat, dcat:packageMediaType and dcat:packageFormat properties actually be some existing ones, if they are found.

Thanks @jakubklimek, introduction of additional properties for compression and packaging is a good idea. The handling of package formats with multiple files requires to distinguish more case:

  1. package with files of same type (e.g. tar file with csv inside)
  2. package with multiple files of different type (unspecified, could be described in a README file inside the package)
  3. package with multiple files of different type as specified in a packaging standard that only defines the inner files but not how these files are packaged (e.g. https://frictionlessdata.io/specs/data-package/).

Thanks @jakubklimek for your thorough analysis.
One comment from my side, related to backward compatibility.
To avoid problems, it might be better not to use existing properties dct:format and dcat:mediaType but to create new properties like dcat:containedFormat and dcat:containedMediaType so that there is no confusion with how people have been using dct:format and dcat:mediaType.

@nichtich You bring up a point that led the DCAT-AP development group not to go deeper into the issue of compressed and packaged files, namely that one could imagine many complex cases that would require a lot of specific properties. If this analysis could lead to a limited number of additional properties, and a clear guideline on how to use them, that would help a lot of people.

@makxdekkers we don't have to cover all cases - @jakubklimek already summarized the most important ones. In short, a distribution file can these independent properties:

  1. it can be compressed. If so, there is a compression format to reference
  2. it can be a package file. If so, there is a package format to reference

Formats can be compression formats (e.g. gzip), package formats (e.g. tar), both (e.g. zip) or none of both (e.g. csv). Furthermore one and only one of these three cases may apply:

  • a package file can follow a defined directory layout (Data Package, Twitter Archive etc.). If so, this standard should be referenced
  • a package file can contain a set of files with same format (e.g. csv). If so, this format should be referenced
  • a compressed non-package file can have a known format to be referenced

These three cases all refer to the internal format and they are disjoint, so I'd use the existing format properties.

To avoid problems, it might be better not to use existing properties dct:format and dcat:mediaType but to create new properties like dcat:containedFormat and dcat:containedMediaType so that there is no confusion with how people have been using dct:format and dcat:mediaType.

@makxdekkers I see your point regarding the backward compatibility. The downside is that the actual data representation format (e.g. CSV, XML, JSON, RDF) will be attached using different properties for compressed/packaged and uncompressed/unpackaged distributions like this:

Uncompressed:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv> ;
    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/text/csv> ;

Compressed:

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .

<https://data.gov.cz/zdroj/datová-sada/247025684/22> a dcat:Distribution ;
    dcat:accessURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dcat:downloadURL <https://mvcr1.opendata.cz/czechpoint/2007.csv.gz> ;
    dct:conformsTo <https://mvcr1.opendata.cz/czechpoint/2007.json> ;
    dct:license <https://data.gov.cz/podmínky-užití/volný-přístup/> ;

    dcat:containedFormat <http://publications.europa.eu/resource/authority/file-type/CSV> ;
    dcat:containedMediaType <http://www.iana.org/assignments/media-types/text/csv> ;

    dcat:packageFormat <http://publications.europa.eu/resource/authority/file-type/TAR> ;
# for TAR there is no media type, but e.g. for ZIP there is dcat:packageMediaType <http://www.iana.org/assignments/media-types/application/zip> ;
    dcat:compressionMediaType <http://www.iana.org/assignments/media-types/application/gzip> ;
    dcat:compressionFormat <http://publications.europa.eu/resource/authority/file-type/GZIP> .

To be honest, I am not sure how publishers actually behaved when faced with this challenge in DCAT 2014, i.e. whether they specified dct:format as CSV or GZIP when describing compressed files. I think both approaches were used in this case, causing confusion, as CSV made more sense as it was more descriptive, while GZIP described more the actual data file published on the web. The original DCAT 2014 definitions did not provide any guidance regarding this:

  • dcat:mediaType: The media type of the distribution as defined by IANA.
  • dct:format: The file format of the distribution.

package with multiple files of different type (unspecified, could be described in a README file inside the package)

@nichtich This case I think is a typical representative of a wrong Dataset design and should be handled by splitting such Dataset according to the individual formats used in the archive, so that they can be described properly by DCAT (e.g. reference to the format used) using the other cases. Or do you have an example where this would actually be appropriate?

@jakubklimek scripsit:

Another way of doing this is saying that we leave compression to HTTP server and client (restriction to HTTP), the server can use gzip_static to serve the .trig.gz file from its file system and the client decompresses it transparently in the HTTP layer. This means the Distribution still points to the .trig file, the media type is application/trig, and it is completely opaque to the user.

And a third way is to configure the web server to look at both Accept and Accept-Encoding. If we for instance have

GET /datasets/ds1
Accept: application/ld+json
Accept-Encoding: gzip

the server would return https://example.org/datasets/ds1.jsonld.gz whereas a request for

GET /datasets/ds1
Accept: application/xml
Accept-Encoding: gzip

would return https://example.org/datasets/ds1.xml.gz
Cf. RFC 7231 §5.3.4

@larsgsvensson Yes. The question is how would this relates to DCAT.

In this case I would imagine a dataset with 2 distributions, one for xml.gz, one for jsonld.gz, each described as shown above. This is necessary because of the other distribution description properties such as a schema, which would be different for XML and for JSON-LD.

This leaves the question of how would /datasets/ds1 be described in DCAT (a Data Service?), and how would it be then used by applications such as data catalogs - they would need to discover the options available to show them to their users, and probably a JavaScript implementation of the content negotiation, as it cannot be done using a simple HTML href.

See discussion on this issue in minutes of DCAT meeting https://www.w3.org/2018/06/28-dxwgdcat-minutes.html#x08

Also note possible related issues #256 and #81

@dr-shorthair A note to your comment in the email summary of the issue:

If the content is simple then the “+zip” strategy on the media-type designator is OK

I disagree.

  1. Some media types already have an extension, e.g. application/ld+json and a media type cannot have 2 extensions
  2. The +zip media type extension indicates the ZIP technique (application/zip), which is only one of many
  3. It would be an extra place to look for information about compression.
  4. What is a simple content and what is a complex content?

This is a potential rabbit hole, too many layers is impractical

Sure, too many layers are impractical, but I was proposing a quite simple solution to common (not all) situations, i.e. compressed file, packaged homogeneous files, and their combination. This also covers a compressed file with a standardized directory structure such as a Data Package.

@arminhaller Regarding your point in the minutes:

What about a compressed file that contains ttl, n3 and rdf/xml files that are all equivalent

These should be 3 dcat:Distributions, e.g. one for .ttl.gz, one for .nt.gz and one for .rdf.gz.

@andrea-perego Regarding your point in the minutes:

for standard nested formats we don't need to do anything

We still need the proposed extension for the common situations.

if nesting is done in an arbitrary way, a readme file within the structure should be used

Primary focus should be on machine readability. In cases something non-standard is used as a distribution, it should be in case where no standard DCAT ways are applicable and this should be documented in the datasets description and documentation.

I like @jakubklimek's analysis. If the problem can be solved with some extra properties, I am all for it. Apart from allowing machine-processing, it is also very relevant to show to the human user what is inside the file as this would help someone to decide not to download a big ZIP file with something inside that the user can't process.