rdfjs/rdfxml-streaming-parser.js

URL encoded strings are decoded in IRIs

ekulno opened this issue ยท 14 comments

Hi, I have a rdf-xml file where an IRI contains the character sequence 
, which is a URL encoding for newlines (\n). In the output of rdfxml-streaming-parser, this string is decoded, so that my IRI now instead contains \n. The same can be seen for other strings such as > and <. This is different from what N3 does for turtle-family parsing. I'm not certain which approach would be correct.

const fs = require('fs');
const RdfXmlParser = require("rdfxml-streaming-parser").RdfXmlParser;
const N3 = require('n3');

fs.createReadStream('test.rdf')
  .pipe(new RdfXmlParser())
  .on('data', console.log)

fs.createReadStream('test.ttl')
  .pipe(new N3.StreamParser())
  .on('data', console.log)

input files:

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:ns0="b:">

  <rdf:Description rdf:about="a:&#xA;">
    <ns0:b rdf:resource="c:c"/>
  </rdf:Description>

</rdf:RDF>
<a:&#xA;><b:b><c:c>.

output:

Quad {
  subject: NamedNode { value: 'a:\n' },
  predicate: NamedNode { value: 'b:b' },
  object: NamedNode { value: 'c:c' },
  graph: DefaultGraph { value: '' }
}
Quad {
  subject: NamedNode { id: 'a:&#xA;' },
  predicate: NamedNode { id: 'b:b' },
  object: NamedNode { id: 'c:c' },
  graph: DefaultGraph { id: '' }
}

Bounty

A bounty has been placed on this issue by:

Triply
โ‚ฌ544

Click here to learn more if you're interested in claiming this bounty by resolving this issue.

As this is standard XML encoding behaviour, this looks like intended behaviour to me.
I quickly checked with some other RDF/XML parsers, and these seem to be doing the same here.

If you want encoded characters in your parsed outputs, I would suggest double encoding of these characters. I suspect existing serializers would to this automatically.

@rubensworks I think you're correct. There is an RDF/XML test case where an ampersand (&) is encoded in the RDF/XML input file, and is decoded in the N-Triples output file: https://www.w3.org/2013/RDFXMLTests/amp-in-url/

However, this does not immediately solve our problem: IIUC there are valid RDF/XML files that do not encode valid RDF graphs. Specifically, an RDF/XML file is allowed to encode characters that violate the abstract syntax rules for RDF terms.

I've asked this at the appropriate W3C mailing list: https://lists.w3.org/Archives/Public/public-rdf-comments/2020Jul/0000.html

Hmm, your point on the unescaped newline makes me suspect that may in fact may be something a parser should check (and error on).
But let's await the response on the mailing list.

Btw, I have noticed in other specs (and their test suites) that IRI validation usually isn't checked very strictly, or even not at all.

@rubensworks I indeed believe that RDF parsers must also -- at least to some extent -- check for IRI validity, otherwise valid RDF serialization documents can encode invalid RDF graphs.

Also, several serialization formats require that parsers resole relative IRIs, which is not possible without -- at least to some extent -- validating the IRI syntax. See https://lists.w3.org/Archives/Public/semantic-web/2018Mar/0016.html for a prior discussion of this.

IMO people who hold that IRI validation is not part of RDF parsing have the following problems:

  1. They must admit that valid RDF documents may encode invalid RDF graphs.
  2. They must somehow satisfy the requirement of relative IRI resolution for invalid IRIs.
  3. They must employ an IRI validator component between their RDF parser and RDF loading components. (In practice, I have never seen such an IRI validator component.)

From @cygri on the W3C mailing list:

I can't find any rationale for ignoring the character reference. And the referenced character is not allowed in an IRI. This would make the document not valid RDF/XML.

Ok, so validating IRIs and throwing an error on invalid ones seems like a good solution.
I'd immeditiately apply this same check for all my parsers.
Given the performance overhead, making this disableable is probably also a good idea.

As discussed with @rubensworks, I will work on this issue via the Comunica Association.

Probably superfluous, but this is still an issue in version 2.1.0

Tpt commented

As discussed with @rubensworks, I will work on this issue via the Comunica Association (pending approval from Triply).

@Tpt Thanks! You certainly have Triply's approval :-)

Thanks to @Tpt's work in #64, v2.2.0 now implements the new validation logic.

@wouterbeek can you confirm on your end that this resolves this bounty?

Thanks for fixing this @Tpt and @rubensworks ! @Ysgorg who originally reporting this bug has checked the fix.

@wouterbeek Thanks for checking!
I'll ask internally to initiate the invoicing process.