`.datatype` should be a string
Closed this issue ยท 8 comments
Literal#datatype
should be a string
In RDF, datatypes are URIs. They are never blank nodes. Using a Term
is a redundant and inconvenient way to make the value of the IRI a little farther from reach. As opposed to a string
, the Term
also has a slight negative impact on memory and performance.
Consistency
You may think it's consistent to wrap .datatype
as a Term
, but at this tier in the class relation tree, it's actually inconsistent with the .value
property of a NamedNode and all other Term properties which are all string
s.
.equals
Aside from the ontology portion of a graph, datatype IRIs just about never coincide with the IRIs of subject or object nodes (i.e. NamedNodes).
-
As a developer, it makes sense that since the overwhelming majority of datatype comparisons are to other datatypes, then the naive comparison is keenest:
if(lit1.datatype === lit2.datatype)
, rather than the obscure cases where having datatype already be aTerm
comes in handy:let subject = // where datatype also happens to appear as a subject if(lit1.datatype.equals(subject))
- Rather, for those rare incidents, you can do:
if(DataFactory.namedNode(lit1.datatype).equals(subject)) // -- or, better yet -- if(subject.value === lit1.datatype && isNamedNode(subject))
- Again, these are obscure cases - usually to fetch some information about the datatype itself from the ontology.
-
The much more common application of datatypes is for testing whether or not a particular datatype is the right one.
// right now, the implication is to use the interface method: if(lit1.datatype.equals(DataFactory.namedNode('http://www.opengis.net/ont/geosparql#'))) // -- but that is messy, less elegant, and much slower perfomance-wise than -- if(lit1.datatype === 'http://www.opengis.net/ont/geosparql#')
Consistency
The other properties are string
s, because of the mapping of the RDF Spec to JS types. Using NamedNode
whenever there is a NamedNode
would be more consistent.
.equals
If we add CURIE support in the future, calling .equals
would be much more elegant. Also, this is a low level API. High level libraries may implement a simpler API.
There is also a simpler way doing a string compare you don't show in your examples:
if(lit1.datatype.value === 'http://www.opengis.net/ont/geosparql#')
Thinking it over, I suppose that if predicate is NamedNode, then datatype should be as well ~ this is consistent. Also, it's a better way for DataFactory's .literal
to distinguish between language tags and datatypes without sophisticated regex.
Continuing the discussion of duplicate #93 here.
It seems that having this extra layer of objects introduced by NamedNode
is far less "JavaScripty" / DRY. For example, to upgrade my unit tests from string
to NamedNode
, I would have to change:
literal.equals({
termType: 'Literal',
value: '',
language: '',
datatype: 'http://www.w3.org/2001/XMLSchema#string',
}).should.be.true;
into
literal.equals({
termType: 'Literal',
value: '',
language: '',
datatype: {
termType: 'NamedNode',
value: 'http://www.w3.org/2001/XMLSchema#string',
}
}).should.be.true;
which is more verbose than necessary (given that termType: 'NamedNode'
is a constant anyway).
Also, should the equals
function also accept the former, or just the latter?
Especially curious for opinions from @bergos, @blake-regalia, @elf-pavlik.
- I think we should consider this issue and #83 together. IMO requiring to pass a NamedNode to the constructor and then getting IRI (string) from the .datatype property seems inconsistent (I realize that different people could have opposite perceptions of consistency)
- We may need to define direction of HighLevel API little more, for example it could allow either NamedNode (Term object) or IRI (string primitive) in all constructors. For accessing the IRI (string) NamedNode#value doesn't seem like a big inconvenience if one knows where to expect getting a NamedNode. We could also clarify what we can and what we can NOT expect HighLevel API to handle.
- Low Level API IMO should prioritize performance. Basing decisions like this one on benchmarks could balance discussions about various perceptions of consistency.
- Tests could have some utilities to keep them little more DRY eg.
literal.equals({
termType: 'Literal',
value: '',
language: '',
datatype: XSD.string
}).should.be.true;
Also, should the equals function also accept the former, or just the latter?
I think once we agree if datatype has as its value NamedNode (Term object) or IRI (string) LowLevel API Literal#equals should expect that and not need to handle both cases.
@elf-pavlik Tests can indeed be made more DRY, but they were just an example. The real question is: do we want to have to make this pattern DRY every time, or should the library just ensure it is?
This is a bit confusing now that we've got this spread across two issues; but I will try to consolidate the arguments here:
Is a datatype really a named node or is it an IRI?
but is the type really a named node, or just an IRI?
Conceptually, the datatype of a literal really is a named node. Most ontologies will thoroughly describe a datatype with functional metadata. For example, the QUDT units vocabulary:
@prefix unit: <http://qudt.org/vocab/unit#> .
@prefix : <ex://data/>
# data
:Resource :hasLength "220"^^unit:Meter .
# ontology
unit:Meter qudt:symbol "m" ;
qudt:conversionMultiplier 1.0 ;
# ...
In this example, unit:Meter
really is a named node in the graph(s). We see the same thing with predicates which get reified with metadata (we also already treat predicates as named nodes). Datatypes are part of this same family.
Constructor vs Accessor
requiring to pass a NamedNode to the constructor and then getting IRI (string) from the .datatype property seems inconsistent
For example, consider the following use case:
Using .datatype
as a Term
:
// create new literal
const EX_DATATYPE = factory.namedNode('ex://cosmos/planet');
let other = factory.literal('mars', EX_DATATYPE);
// create new literal w/ same datatype as `other` literal
let colonize = factory.literal('venus', other.datatype); // <-- this makes sense
Using .datatype
as a string
:
const EX_DATATYPE = 'ex://cosmos/planet';
let other = factory.literal('mars', EX_DATATYPE);
// create new literal w/ same datatype as `other` literal
let colonize = factory.literal('venus', factory.namedNode(other.datatype));
// we must turn a string into a Term ^^, so that it can be turned back into a string...
Extensibility
Other
NamedNode
methods: Maybe we will add some other methods with the high level API forNamedNode
, which we don't have in mind now.
If .datatype
were a string
, then concrete implementations or a high-level API could not extend datatype with custom properties or prototypical methods.
For example:
let triple = library.someTripleWithLiteral();
triple.subject.doSomethingWithURL(); // works
triple.object.datatype.doSomethingWithURL(); // Uncaught TypeError: __ is not a function
Performance
Low Level API IMO should prioritize performance. Basing decisions like this one on benchmarks could balance discussions about various perceptions of consistency.
Performance: It [
Term
] can be implemented without performance drawbacks and the implementation will be still super simple. If we use string we must rely on the JS engine (don't forget embedded devices!).
I created two versions of graphy, one using Term
and one using string
for .datatype
. I benchmarked parsing a 48.2mb ttl file containing 282k triples, 129k literals w/ 34.5k xsd:integer datatypes. I also benchmarked the following operations:
- datatype filter: searching every literal for a specific datatype (internally w/ optimizations).
- datatype equals: search every literal for a specific datatype (using interface methods)
- literal equals: search every literal for a specific literal
Benchmark Results:
case | Term |
string |
comment |
---|---|---|---|
parse | 520ms | 510ms* | this assumes the user builds an index of datatypes for the Term datatype filter case while consuming triples (hence the 10ms difference) |
datatype filter | 3.4ms* | 9.16ms | Term is ~2.7x faster when using === to compare .datatype Term objects by their address, as opposed to string comparison which checks each character until finding a difference |
datatype equals | 7.67ms | 6.54ms* | string is 17% faster here |
literal equals | 9.4ms | 9.14ms* | string is only 2% faster here |
The Term
also actually has a slightly smaller memory footprint (121mb as opposed to string
's 124mb consumption). I assume this is due to the fact that the GC frees the strings after consolidating datatype objects w/ the hash I implemented for the test. This also implies that V8's string interning is not that aggressive.
Benchmark Conclusions:
With optimizations, Term
outperforms string
in filtering for a specific datatype. It also consumes less memory (~3mb for this input file). string
's only real performance strength is seen in the datatype equals case, which is not a huge difference anyway (~17%).
Great work, @blake-regalia. I follow everything except the "Constructor vs Accessor" argument (i.e., we would obviously also redefine the constructor if we made .datatype
a string
).
I still think it's more cumbersome for developers to have the datatype as a NamedNode. It makes the toJSON
unnecessarily complex. But that's my only objection in presence of lots of other arguments. I'm convinced now to go for NamedNode
.
resolved in #92