URL encoded strings are decoded in IRIs
ekulno opened this issue ยท 14 comments
Hi, I have a rdf-xml file where an IRI contains the character sequence 

, which is a URL encoding for newlines (\n
). In the output of rdfxml-streaming-parser, this string is decoded, so that my IRI now instead contains \n
. The same can be seen for other strings such as >
and <
. This is different from what N3 does for turtle-family parsing. I'm not certain which approach would be correct.
const fs = require('fs');
const RdfXmlParser = require("rdfxml-streaming-parser").RdfXmlParser;
const N3 = require('n3');
fs.createReadStream('test.rdf')
.pipe(new RdfXmlParser())
.on('data', console.log)
fs.createReadStream('test.ttl')
.pipe(new N3.StreamParser())
.on('data', console.log)
input files:
<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ns0="b:">
<rdf:Description rdf:about="a:
">
<ns0:b rdf:resource="c:c"/>
</rdf:Description>
</rdf:RDF>
<a:
><b:b><c:c>.
output:
Quad {
subject: NamedNode { value: 'a:\n' },
predicate: NamedNode { value: 'b:b' },
object: NamedNode { value: 'c:c' },
graph: DefaultGraph { value: '' }
}
Quad {
subject: NamedNode { id: 'a:
' },
predicate: NamedNode { id: 'b:b' },
object: NamedNode { id: 'c:c' },
graph: DefaultGraph { id: '' }
}
Bounty
A bounty has been placed on this issue by:
โฌ544 |
Click here to learn more if you're interested in claiming this bounty by resolving this issue.
As this is standard XML encoding behaviour, this looks like intended behaviour to me.
I quickly checked with some other RDF/XML parsers, and these seem to be doing the same here.
If you want encoded characters in your parsed outputs, I would suggest double encoding of these characters. I suspect existing serializers would to this automatically.
@rubensworks I think you're correct. There is an RDF/XML test case where an ampersand (&
) is encoded in the RDF/XML input file, and is decoded in the N-Triples output file: https://www.w3.org/2013/RDFXMLTests/amp-in-url/
However, this does not immediately solve our problem: IIUC there are valid RDF/XML files that do not encode valid RDF graphs. Specifically, an RDF/XML file is allowed to encode characters that violate the abstract syntax rules for RDF terms.
I've asked this at the appropriate W3C mailing list: https://lists.w3.org/Archives/Public/public-rdf-comments/2020Jul/0000.html
Hmm, your point on the unescaped newline makes me suspect that may in fact may be something a parser should check (and error on).
But let's await the response on the mailing list.
Btw, I have noticed in other specs (and their test suites) that IRI validation usually isn't checked very strictly, or even not at all.
@rubensworks I indeed believe that RDF parsers must also -- at least to some extent -- check for IRI validity, otherwise valid RDF serialization documents can encode invalid RDF graphs.
Also, several serialization formats require that parsers resole relative IRIs, which is not possible without -- at least to some extent -- validating the IRI syntax. See https://lists.w3.org/Archives/Public/semantic-web/2018Mar/0016.html for a prior discussion of this.
IMO people who hold that IRI validation is not part of RDF parsing have the following problems:
- They must admit that valid RDF documents may encode invalid RDF graphs.
- They must somehow satisfy the requirement of relative IRI resolution for invalid IRIs.
- They must employ an IRI validator component between their RDF parser and RDF loading components. (In practice, I have never seen such an IRI validator component.)
From @cygri on the W3C mailing list:
I can't find any rationale for ignoring the character reference. And the referenced character is not allowed in an IRI. This would make the document not valid RDF/XML.
Ok, so validating IRIs and throwing an error on invalid ones seems like a good solution.
I'd immeditiately apply this same check for all my parsers.
Given the performance overhead, making this disableable is probably also a good idea.
As discussed with @rubensworks, I will work on this issue via the Comunica Association.
Probably superfluous, but this is still an issue in version 2.1.0
As discussed with @rubensworks, I will work on this issue via the Comunica Association (pending approval from Triply).
@Tpt Thanks! You certainly have Triply's approval :-)
Thanks to @Tpt's work in #64, v2.2.0 now implements the new validation logic.
@wouterbeek can you confirm on your end that this resolves this bounty?
Thanks for fixing this @Tpt and @rubensworks ! @Ysgorg who originally reporting this bug has checked the fix.
@wouterbeek Thanks for checking!
I'll ask internally to initiate the invoicing process.