Distinct blank nodes are inadvertently merged
wouterbeek opened this issue · 9 comments
When the following RDF/XML snippet is parsed, this results in one creator with two labels. IIUC, this is a bug, since the result of parsing this snippet should consist of two creators with distinct blank node subject terms.
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:dct="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="https://example.com/">
<dct:creator>
<rdf:Description>
<rdfs:label>ABC</rdfs:label>
</rdf:Description>
</dct:creator>
<dct:creator>
<rdf:Description>
<rdfs:label>XYZ</rdfs:label>
</rdf:Description>
</dct:creator>
</rdf:Description>
</rdf:RDF>
I have tested this with Rapper 2.0.14, which returns the following (output format Turtle):
<https://example.com/>
dct:creator [
rdfs:label "George Fazekas"
], [
rdfs:label "Simon Reinhardt"
] .
Thanks for reporting, I'll look into it as soon as possible!
@rubensworks Thanks! Now that I am aware of this bug, I can detect it in real-world datasets such as the following:
Dataset | № triples with Rapper | № triples with rdfxml-streaming-parser |
---|---|---|
https://www.w3.org/2007/ont/unit# | 172 | 133 |
http://purl.org/ontology/mo/ | 2.141 | 1.858 |
Hmm, that's really bad. I'm surprised this hasn't come up sooner, I would've expected the test suite to cover this.
I'm surprised this hasn't come up sooner, I would've expected the test suite to cover this.
@rubensworks You're only 100 commits into supporting a very difficult data serialization format. (For reference, Raptor has 7K+ commits: https://github.com/dajobe/raptor) I think you're doing great.
I just tested this myself, and as far as I can see, things seem to be working properly.
Rdfxml-streaming-parser outputs the following when I test it locally:
<https://example.com/> <http://purl.org/dc/terms/creator> _:b1.
_:b1 <http://www.w3.org/2000/01/rdf-schema#label> "ABC".
<https://example.com/> <http://purl.org/dc/terms/creator> _:b2.
_:b2 <http://www.w3.org/2000/01/rdf-schema#label> "XYZ".
This should be equivalent to the output of Raptor.
(Added a unit test to confirm this: 3479a47)
@wouterbeek Is it possible that something may have gone wrong in your tests? Perhaps you are using an outdated version of rdfxml-streaming-parser?
I'll also check the difference in triple counts in a bit.
Regarding the different triple counts:
This seems to be caused by the fact that some (invalid) triples are ignored when no baseIRI
is set. If I parse those documents with a proper baseIRI
, then the triple counts match those from Raptor.
Thanks for looking into this. I cannot verify this ATM, but test this on my side soon and report back.
Sorry, this was a mistake on our side :-(
No worries :-)