zazuko/rdf-validate-shacl

How to validate literals based on their datatype IRI?

wouterbeek opened this issue · 5 comments

I do not understand how literals should be validated based on their datatype IRI. I make the following observations:

  1. For some literals specifying the datatype IRI with sh:datatype seems to suffice in order to also check their lexical form. An example of this is xsd:boolean, where lexical form "-false" is currently not accepted because the minus sign is not part of the syntax for Boolean lexical forms.

  2. For some literals specifying the datatype IRI with sh:datatype does not seem sufficient, since incorrect lexical forms are still accepted. An example of this is xsd:double for which "--1.1e0" is accepted, even though the double occurrence of the hyphen is not supported by the floating-point syntax.

  3. At the same time, it is also not clear how regular expressions could be manually specified in order to fix the absence of lexical form validation (see #44 for generic issues with the way in which regular expressions are currently supported). For example, specifying the regular expression sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" copied from the XSD standard alongside sh:datatype xsd:double still allows validates literals like "--1.1e1"^^xsd:double as ok, even though they violate both the datatype IRI and the regular expression specifications.

At the moment it is difficult for me to determine what is intended behavior and what is a bug. It would be great if SHACL could be used to validate literals, but I am not sure whether (1) such validation is indeed intended by the SHACL standard, and whether (2) it is technologically feasible to implement such validation with contemporary technology.

Could you provide the above cases complete with shapes and sample data?

Also, please check with SHACL playground to see what are the results there

@tpluscode I have not done anything complicated yet. I think that even the most simple things like the XSD literals do not work. I can still share my files of course :-)

This is my data file:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
[ a <C>;
  <p> "-false"^^xsd:boolean; # This will not validate when `sh:datatype xsd:boolean` is used.
  <r> "--1.1e0"^^xsd:double ]. # This will validate when `sh:datatype xsd:double` is used.

And this is my patterns file:

prefix sh: <http://www.w3.org/ns/shacl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

[ sh:property
    [ sh:datatype xsd:boolean;
      sh:path <p> ],
    [ sh:datatype xsd:double;
      sh:path <r>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" ]; # This does not do anything at all IIUC.
  sh:targetClass <C> ].

I have added a couple more example. This is mostly a copy/paste from the XSD standard. I have replaced backward slashes with double backward slashes, since this seems to be required. Since I do not know the Regex grammar, I do know whether the Regexes are valid (the library does not give feedback when a Regex cannot be processed).

This is my patterns file:

prefix sh: <http://www.w3.org/ns/shacl#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

[ sh:property
    [ sh:datatype xsd:boolean;
      sh:path <boolean>;
      sh:pattern "false|true|0|1" ],
    [ sh:datatype xsd:date;
      sh:path <date>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:dateTime;
      sh:path <dateTime>;
      sh:pattern """
-?([1-9][0-9]{3,}|0[0-9]{3})
-(0[1-9]|1[0-2])
-(0[1-9]|[12][0-9]|3[01])
T(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\\.[0-9]+)?|(24:00:00(\\.0+)?))
(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?""" ],
    [ sh:datatype xsd:decimal;
      sh:path <decimal>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)" ],
    [ sh:datatype xsd:double;
      sh:path <double>;
      sh:pattern "(\\+|-)?([0-9]+(\\.[0-9]*)?|\\.[0-9]+)([Ee](\\+|-)?[0-9]+)? |(\\+|-)?INF|NaN" ],
    [ sh:datatype xsd:duration;
      sh:path <duration>;
      sh:pattern """
-?P( ( ( [0-9]+Y([0-9]+M)?([0-9]+D)?
       | ([0-9]+M)([0-9]+D)?
       | ([0-9]+D)
       )
       (T ( ([0-9]+H)([0-9]+M)?([0-9]+(\\.[0-9]+)?S)?
          | ([0-9]+M)([0-9]+(\\.[0-9]+)?S)?
          | ([0-9]+(\\.[0-9]+)?S)
          )
       )?
    )
  | (T ( ([0-9]+H)([0-9]+M)?([0-9]+(\\.[0-9]+)?S)?
       | ([0-9]+M)([0-9]+(\\.[0-9]+)?S)?
       | ([0-9]+(\\.[0-9]+)?S)
       )
    )
  )""" ],
    [ sh:datatype xsd:gMonth;
      sh:path <gMonth>;
      sh:pattern "--(0[1-9]|1[0-2])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:gYear;
      sh:path <gYear>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:gYearMonth;
      sh:path <gYearMonth>;
      sh:pattern "-?([1-9][0-9]{3,}|0[0-9]{3})-(0[1-9]|1[0-2])(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ],
    [ sh:datatype xsd:string;
      sh:path <string>;
      sh:pattern "\\S" ],
    [ sh:datatype xsd:time;
      sh:path <time>;
      sh:pattern "(([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9](\\.[0-9]+)?|(24:00:00(\\.0+)?))(Z|(\\+|-)((0[0-9]|1[0-3]):[0-5][0-9]|14:00))?" ];
  sh:targetClass <C> ].

This is my data file:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>
<i>
  a <C>;
  <boolean> false, "0"^^xsd:boolean;
  <date> "-1-01-01"^^xsd:date;
  <dateTime> "-1-01-01T00:00:00-00:00"^^xsd:dateTime;
  <decimal> -01.10, "-02.20"^^xsd:decimal;
  <double> -1.1e+0, "-2.2e+0"^^xsd:double;
  <duration> "-1-01-01T00:00:00-00:00"^^xsd:duration;
  <gMonth> "--01"^^xsd:gMonth;
  <gYear> "-1"^^xsd:gYear, "111111"^^xsd:gYear;
  <gYearMonth> "-1-01Z"^^xsd:gYear, "111111-01Z"^^xsd:gYear;
  <string> "😺", "😺"^^xsd:string;
  <time> "00:00:00-00:00"^^xsd:time.

Since Regex is a crude approach for validating lexical forms, it would be better if lexical forms could also be validated by specifying the datatype IRI (sh:datatype). If that is not feasible, then having proper Regex support would at least allow us to add sh:pattern triples based on the presence of sh:datatype triples.

After looking at your examples in the SHACL playground and the spec I have a few observations:

  1. Boolean acts wrong, where the library treats the truthiness of the literal. Thus 0 becomes false and pretty much anything else becomes true. We probably inherited that issue too
  2. You got those regex from W3C XML Schema? I think the whitespace is a problem in some. For example, the double expression has a space before the |(\\+|-)?INF|NaN patterns. Remove that space and it will work
  3. Otherwise you will need to add start/end of line symbols ^$. Without them you risk matching only portion of the literal.
  4. Strangely, decimal actually gets validated by the datatype constraint alone
  5. The regex created by the library probably needs a u flag to handle unicode correctly
    image

Now, while the spec does not mention checking the lexical correctness of literals, it could be added as an option to the library. What do you think @martinmaillard ?

This library already uses rdf-validate-datatype to validate the lexical correctness of literals. So if something gets validated wrong, it's probably a bug.