rdfjs/rdfxml-streaming-parser.js

Distinct blank nodes are inadvertently merged

wouterbeek opened this issue · 9 comments

When the following RDF/XML snippet is parsed, this results in one creator with two labels. IIUC, this is a bug, since the result of parsing this snippet should consist of two creators with distinct blank node subject terms.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
    xmlns:dct="http://purl.org/dc/terms/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
  <rdf:Description rdf:about="https://example.com/">
    <dct:creator>
      <rdf:Description>
        <rdfs:label>ABC</rdfs:label>
      </rdf:Description>
    </dct:creator>
    <dct:creator>
      <rdf:Description>
        <rdfs:label>XYZ</rdfs:label>
      </rdf:Description>
    </dct:creator>
  </rdf:Description>
</rdf:RDF>

I have tested this with Rapper 2.0.14, which returns the following (output format Turtle):

<https://example.com/>
    dct:creator [
        rdfs:label "George Fazekas"
    ], [
        rdfs:label "Simon Reinhardt"
    ] .

Thanks for reporting, I'll look into it as soon as possible!

@rubensworks Thanks! Now that I am aware of this bug, I can detect it in real-world datasets such as the following:

Dataset № triples with Rapper № triples with rdfxml-streaming-parser
https://www.w3.org/2007/ont/unit# 172 133
http://purl.org/ontology/mo/ 2.141 1.858

Hmm, that's really bad. I'm surprised this hasn't come up sooner, I would've expected the test suite to cover this.

I'm surprised this hasn't come up sooner, I would've expected the test suite to cover this.

@rubensworks You're only 100 commits into supporting a very difficult data serialization format. (For reference, Raptor has 7K+ commits: https://github.com/dajobe/raptor) I think you're doing great.

I just tested this myself, and as far as I can see, things seem to be working properly.
Rdfxml-streaming-parser outputs the following when I test it locally:

<https://example.com/> <http://purl.org/dc/terms/creator> _:b1.
_:b1 <http://www.w3.org/2000/01/rdf-schema#label> "ABC".
<https://example.com/> <http://purl.org/dc/terms/creator> _:b2.
_:b2 <http://www.w3.org/2000/01/rdf-schema#label> "XYZ".

This should be equivalent to the output of Raptor.
(Added a unit test to confirm this: 3479a47)

@wouterbeek Is it possible that something may have gone wrong in your tests? Perhaps you are using an outdated version of rdfxml-streaming-parser?

I'll also check the difference in triple counts in a bit.

Regarding the different triple counts:
This seems to be caused by the fact that some (invalid) triples are ignored when no baseIRI is set. If I parse those documents with a proper baseIRI, then the triple counts match those from Raptor.

Thanks for looking into this. I cannot verify this ATM, but test this on my side soon and report back.

Sorry, this was a mistake on our side :-(

No worries :-)