rdfjs/data-model-spec

`.datatype` should be a string

Closed this issue ยท 8 comments

Literal#datatype should be a string

In RDF, datatypes are URIs. They are never blank nodes. Using a Term is a redundant and inconvenient way to make the value of the IRI a little farther from reach. As opposed to a string, the Term also has a slight negative impact on memory and performance.

Consistency

You may think it's consistent to wrap .datatype as a Term, but at this tier in the class relation tree, it's actually inconsistent with the .value property of a NamedNode and all other Term properties which are all strings.

.equals

Aside from the ontology portion of a graph, datatype IRIs just about never coincide with the IRIs of subject or object nodes (i.e. NamedNodes).

  • As a developer, it makes sense that since the overwhelming majority of datatype comparisons are to other datatypes, then the naive comparison is keenest: if(lit1.datatype === lit2.datatype), rather than the obscure cases where having datatype already be a Term comes in handy:

    let subject = // where datatype also happens to appear as a subject
    if(lit1.datatype.equals(subject))
    • Rather, for those rare incidents, you can do:
    if(DataFactory.namedNode(lit1.datatype).equals(subject))
    // -- or, better yet --
    if(subject.value === lit1.datatype && isNamedNode(subject))
    • Again, these are obscure cases - usually to fetch some information about the datatype itself from the ontology.
  • The much more common application of datatypes is for testing whether or not a particular datatype is the right one.

    // right now, the implication is to use the interface method:
    if(lit1.datatype.equals(DataFactory.namedNode('http://www.opengis.net/ont/geosparql#')))
    // -- but that is messy, less elegant, and much slower perfomance-wise than --
    if(lit1.datatype === 'http://www.opengis.net/ont/geosparql#')

Consistency

The other properties are strings, because of the mapping of the RDF Spec to JS types. Using NamedNode whenever there is a NamedNode would be more consistent.

.equals

If we add CURIE support in the future, calling .equals would be much more elegant. Also, this is a low level API. High level libraries may implement a simpler API.

There is also a simpler way doing a string compare you don't show in your examples:

if(lit1.datatype.value === 'http://www.opengis.net/ont/geosparql#')

Thinking it over, I suppose that if predicate is NamedNode, then datatype should be as well ~ this is consistent. Also, it's a better way for DataFactory's .literal to distinguish between language tags and datatypes without sophisticated regex.

Continuing the discussion of duplicate #93 here.

It seems that having this extra layer of objects introduced by NamedNode is far less "JavaScripty" / DRY. For example, to upgrade my unit tests from string to NamedNode, I would have to change:

literal.equals({
  termType: 'Literal',
  value: '',
  language: '',
  datatype: 'http://www.w3.org/2001/XMLSchema#string',
}).should.be.true;

into

literal.equals({
  termType: 'Literal',
  value: '',
  language: '',
  datatype: {
    termType: 'NamedNode',
    value: 'http://www.w3.org/2001/XMLSchema#string',
  }
}).should.be.true;

which is more verbose than necessary (given that termType: 'NamedNode' is a constant anyway).

Also, should the equals function also accept the former, or just the latter?

Especially curious for opinions from @bergos, @blake-regalia, @elf-pavlik.

  1. I think we should consider this issue and #83 together. IMO requiring to pass a NamedNode to the constructor and then getting IRI (string) from the .datatype property seems inconsistent (I realize that different people could have opposite perceptions of consistency)
  2. We may need to define direction of HighLevel API little more, for example it could allow either NamedNode (Term object) or IRI (string primitive) in all constructors. For accessing the IRI (string) NamedNode#value doesn't seem like a big inconvenience if one knows where to expect getting a NamedNode. We could also clarify what we can and what we can NOT expect HighLevel API to handle.
  3. Low Level API IMO should prioritize performance. Basing decisions like this one on benchmarks could balance discussions about various perceptions of consistency.
  4. Tests could have some utilities to keep them little more DRY eg.
literal.equals({
  termType: 'Literal',
  value: '',
  language: '',
  datatype: XSD.string
}).should.be.true;

Also, should the equals function also accept the former, or just the latter?

I think once we agree if datatype has as its value NamedNode (Term object) or IRI (string) LowLevel API Literal#equals should expect that and not need to handle both cases.

@elf-pavlik Tests can indeed be made more DRY, but they were just an example. The real question is: do we want to have to make this pattern DRY every time, or should the library just ensure it is?

This is a bit confusing now that we've got this spread across two issues; but I will try to consolidate the arguments here:

Is a datatype really a named node or is it an IRI?

@RubenVerborgh

but is the type really a named node, or just an IRI?

Conceptually, the datatype of a literal really is a named node. Most ontologies will thoroughly describe a datatype with functional metadata. For example, the QUDT units vocabulary:

@prefix unit: <http://qudt.org/vocab/unit#> .
@prefix : <ex://data/>

# data
:Resource :hasLength "220"^^unit:Meter .

# ontology
unit:Meter qudt:symbol "m" ;
    qudt:conversionMultiplier 1.0 ;
    # ...

In this example, unit:Meter really is a named node in the graph(s). We see the same thing with predicates which get reified with metadata (we also already treat predicates as named nodes). Datatypes are part of this same family.

Constructor vs Accessor

@elf-pavlik

requiring to pass a NamedNode to the constructor and then getting IRI (string) from the .datatype property seems inconsistent

For example, consider the following use case:

Using .datatype as a Term:

// create new literal
const EX_DATATYPE = factory.namedNode('ex://cosmos/planet');
let other = factory.literal('mars', EX_DATATYPE);

// create new literal w/ same datatype as `other` literal
let colonize = factory.literal('venus', other.datatype);  // <-- this makes sense

Using .datatype as a string:

const EX_DATATYPE = 'ex://cosmos/planet';
let other = factory.literal('mars', EX_DATATYPE);

// create new literal w/ same datatype as `other` literal
let colonize = factory.literal('venus', factory.namedNode(other.datatype));
    // we must turn a string into a Term ^^, so that it can be turned back into a string...

Extensibility

@bergos

Other NamedNode methods: Maybe we will add some other methods with the high level API for NamedNode, which we don't have in mind now.

If .datatype were a string, then concrete implementations or a high-level API could not extend datatype with custom properties or prototypical methods.

For example:

let triple = library.someTripleWithLiteral();
triple.subject.doSomethingWithURL();  // works
triple.object.datatype.doSomethingWithURL();  // Uncaught TypeError: __ is not a function

Performance

@elf-pavlik

Low Level API IMO should prioritize performance. Basing decisions like this one on benchmarks could balance discussions about various perceptions of consistency.

@bergos

Performance: It [Term] can be implemented without performance drawbacks and the implementation will be still super simple. If we use string we must rely on the JS engine (don't forget embedded devices!).

I created two versions of graphy, one using Term and one using string for .datatype. I benchmarked parsing a 48.2mb ttl file containing 282k triples, 129k literals w/ 34.5k xsd:integer datatypes. I also benchmarked the following operations:

  • datatype filter: searching every literal for a specific datatype (internally w/ optimizations).
  • datatype equals: search every literal for a specific datatype (using interface methods)
  • literal equals: search every literal for a specific literal

Benchmark Results:

case Term string comment
parse 520ms 510ms* this assumes the user builds an index of datatypes for the Term datatype filter case while consuming triples (hence the 10ms difference)
datatype filter 3.4ms* 9.16ms Term is ~2.7x faster when using === to compare .datatype Term objects by their address, as opposed to string comparison which checks each character until finding a difference
datatype equals 7.67ms 6.54ms* string is 17% faster here
literal equals 9.4ms 9.14ms* string is only 2% faster here

The Term also actually has a slightly smaller memory footprint (121mb as opposed to string's 124mb consumption). I assume this is due to the fact that the GC frees the strings after consolidating datatype objects w/ the hash I implemented for the test. This also implies that V8's string interning is not that aggressive.

Benchmark Conclusions:

With optimizations, Term outperforms string in filtering for a specific datatype. It also consumes less memory (~3mb for this input file). string's only real performance strength is seen in the datatype equals case, which is not a huge difference anyway (~17%).

Great work, @blake-regalia. I follow everything except the "Constructor vs Accessor" argument (i.e., we would obviously also redefine the constructor if we made .datatype a string).

I still think it's more cumbersome for developers to have the datatype as a NamedNode. It makes the toJSON unnecessarily complex. But that's my only objection in presence of lots of other arguments. I'm convinced now to go for NamedNode.

resolved in #92