XML predicate mapping repeating child elements getting concatenated if reference includes concatenation
schivmeister opened this issue · 3 comments
Environment
rmlmapper v6.5.1 (reproducible also as far back as v6.1.3)
Linux/WSL2
Java 17, 11
Namespaces
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix ex: <http://data.example.org/resource/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix fnml: <http://semweb.mmlab.be/ns/fnml#> .
@prefix fno: <https://w3id.org/function/ontology#> .
@prefix idlab-fn: <http://example.com/idlab/function/> .
Problem
Given the following kind of input XML with two Organization
elements, where the first has two child Name
elements:
<Directory>
<Organization>
<ID>123</ID>
<Name>ABC Fast Company</Name>
<Name>ABC FastCo</Name>
</Organization>
<Organization>
<ID>456</ID>
<Name>XYZ Inc.</Name>
</Organization>
</Directory>
and the following kind of RML mapping involving a custom concatenated value in the source reference:
ex:Organizations a rr:TriplesMap;
rml:logicalSource [
rml:source "test.xml";
rml:iterator "/Directory/Organization";
rml:referenceFormulation ql:XPath
];
rr:subjectMap [
rr:template "http://data.example.org/resource/Organization_{ID}";
rr:class org:Organization
];
rr:predicateObjectMap [
rr:predicate org:name;
rr:objectMap
[
rml:reference "'CustomPrefix ' || Name || ' CustomSuffix'"
];
]
.
Actual
Results in an unexpected output of the first resource's name
concatenating the repeating values in between the prefix and suffix, instead of multiple comma-separated RDF/Turtle values:
ex:Organization_123 a org:Organization;
org:name "CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix" . # these are values from two Name elements
# the second resource remains unaffected (correctly formed)
ex:Organization_456 a org:Organization;
org:name "CustomPrefix XYZ Inc. CustomSuffix" .
Expected
Should result in multiple comma-separated values mapped from the XML child elements, adhering to the condition of the reference:
ex:Organization_123 a org:Organization;
org:name "CustomPrefix ABC Fast Company CustomSuffix", "CustomPrefix ABC FastCo CustomSuffix" .
Workaround
Template bypassing XPath expressions
This is perhaps the closest thing to an actual solution (if you don't need additional XPath complexity):
rr:objectMap
[
rr:template "CustomPrefix {Name} CustomSuffix" ;
rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
];
producing the correct result:
ex:Organization_123 a org:Organization;
org:name "CustomPrefix ABC Fast Company CustomSuffix", "CustomPrefix ABC FastCo CustomSuffix" .
ex:Organization_456 a org:Organization;
org:name "CustomPrefix XYZ Inc. CustomSuffix" .
Plain reference with out-of-band strategies
One could skip using the reference altogether and employ a different technique, with something external, to replicate the desired outcome, for e.g. using (custom) functions, or even just looking up a mapping table using a parentTriplesMap.
Removing the concatenation obviously makes it work:
rr:objectMap
[
rml:reference "Name"
];
resulting in:
ex:Organization_123 a org:Organization;
org:name "ABC Fast Company", "ABC FastCo" .
Reoriented iterator
Using an iterator on the child element which repeats but creating the subject using the ancestor element appears to work:
ex:Organizations a rr:TriplesMap;
rml:logicalSource [
rml:source "test.xml";
rml:iterator "/Directory/Organization/Name";
rml:referenceFormulation ql:XPath
];
rr:subjectMap [
rr:template "http://data.example.org/resource/Organization_{../ID}";
rr:class org:Organization
];
rr:predicateObjectMap [
rr:predicate org:name;
rr:objectMap
[
rml:reference "'CustomPrefix ' || . || ' CustomSuffix'"
];
]
.
However, this is unintuitive and convoluted. The correct solution would be if repeating child elements were also repeated as values for a predicateObjectMap, as they normally are with a plain reference (or template).
MWE
rml-mwe-concat-multivalue.zip (excludes template example)
Context
Thanks for the very detailed bug report! I'm afraid this is an old RML spec issue, being underspecified how to work with multiple valued references (resulting in sometimes very weird results as you've detailed here, eg in combination with rr:template or a function). We're working on improving the new version of the spec and a more global solution using the Logical Views extension, with a PoC implementation available (and paper being presented next month), however, that's all still in alpha stage.
So, there are actually 3 paths that can be taken in parallel, I think:
- you keep following the unintuitive solution, as that's probably most mature
- you try the logical views PoC as an experiment to see whether that solves this and more problems, and help us with feedback on the spec
- we double-check this bug report to see whether this is an edge case that should also be fixed in the core RML spec, and could thus result in a bug fix in the current RMLMapper-JAVA
We'll check when we can dedicate some time on this bug report, but as you can imagine as an academic institution, it's always trying to find a balance wrt our research roadmaps/paid projects. If this would be really blocking you, feel free to reach out at info@rml.io to see how we can prioritize this!
Thank you for the swift response @bjdmeest! It already helps a lot to know that I'm not (likely) making a mistake somewhere. I understand that offering a resolution is not always possible, which is totally fine. We will reach out if it indeed turns out to be a blocker.
There are potentially other solutions depending on the use case, e.g. in our case it was originally related to a lookup based on modified source values, but we decided to encode certain values in the lookup table as a workaround instead, so that we need not modify the reference.
Otherwise, I took a look again at the Logical Views extension, which I did check out briefly once before for tabular lookups. However, I don't see XML as a supported source format in the reference/PoC implementation, and I also think it attacks a different problem.
Nevertheless, I took the liberty to try and figure out where in the code this is likely happening. It appears to be an issue with the dataio library's XMLRecord.get() implementation as called in ReferenceExtractor::extract(). Trying to reproduce the issue record.get()
yields:
[CustomPrefix ABC Fast CompanyABC FastCo CustomSuffix]
instead of
[CustomPrefix ABC Fast Company CustomSuffix, CustomPrefix ABC FastCo CustomSuffix]
or in the case of a plain reference:
[ABC Fast Company, ABC FastCo]
It could very well be that the concatenation causes unexpected behaviour in the evaluation of the XPaths (using Saxon?), as a direct concat on repeating elements would otherwise raise an error of the form:
error: A sequence of more than one item is not allowed as the second argument of fn:concat() ...
Tested using:
java -cp saxon-he-12.4.jar net.sf.saxon.Query -s:test.xml -qs:"concat('CustomPrefix', /Directory/Organization[1]/Name, ' CustomSuffix')"
But we are not getting an error in the mapping itself, just unexpected concatenation, which indicates that the function works but is being evaluated on the entire set of XPath query results.
I realized after all that we also have template, which works:
rr:objectMap
[
rr:template "CustomPrefix {Name} CustomSuffix" ;
rr:datatype xsd:string ; # an explicit type is required otherwise termType IRI is inferred and error raised
];
So, this is a very valid alternative for simple cases not involving other XPath expressions (added as a workaround in the original post).