A blend of Comunica and TREE, the finest ingredients for querying a decentralized Web
Learn more about Comunica on the comunica.dev.
Read the TREE specification at treecg.github.io/specification.
Overview
This is a fork of the Comunica framework, so that most of Comunica's documentation can be applied to this project as well. The main difference is the scope of the two projects; whereas Comunica focuses on querying Knowledge Graphs, this fork focuses on utilizing the TREE hypermedia specification for usecases such as autocompletion and tracking evolving datasets.
Components
ActorInitTypeahead
This actor has its own documentation.
ActorRdfScore
Some usecases require the ordering of all found results, often because only a fixed small amount of results will be shown to the end-user. These actors score a single RDF quad to { [-Inf, +Inf] ∪ {null} }
, where +Inf signifies the best possible score, and -Inf signifies the worst possible score. null
indicates the actor could not score this quad. These scores have a partial ordering, where the null
score is incomparable to any other score.
A single score will often be insufficient as additional scores are needed to break ties. For example, it may be desirable to first order by the length of the common substring, followed by the length of the found string. Sequences of scores can be ordered as well, where (a,b) ≤ (a′,b′) if and only if a < a′ or (¬(a' < a') and b ≤ b′)
.
The Action object given to these actors contains the RDF quad that needs to be scored, and an optional literalValue
property that indicates that the quad's object is to be interpreted as this literal -- presumably because it has been normalized. The action also specifies which values are being searched for, either by specifying the expected values for a specific predicate, or as the expected values for a datatype as a whole. These expected values share the same Typescript definition: type IExpectedValues = Record<string, any[]>;
, where the Record's keys are URIs and the values may be any Javascript value. It is the actor's responsibility to decide which of the specified values to compare to the quad, or to return null
if the actor is unable to score this quad at all.
The following actors are currently implemented:
- ActorRdfScoreBigram: Counts how many bigrams of the expected values can be found in the scored quad's literal. Returns
null
if the quad's object is not axsd:string
orxsd:langString
. - ActorRdfScoreCommonPrefix: Returns the length of the longest common prefix between the expected values and the quad's literal value, or
null
if the quad's object is not axsd:string
orxsd:langString
. - ActorRdfScoreExactPrefix: Like the above, but returns
-Inf
if the none of the expected values are a prefix of the quad's value(s). - ActorRdfScoreStringLength: As the name implies, simply returns the length of the string.
ActorTreeScore
Like the ActorRdfScore, but for TREE relations and their specified values. There are two differences though:
-
The Action object only contains a
values
and anexpectedValues
value, which both share this type definition:type TreeValues = Record<string, RDF.Literal[]>
. The key in this record type is the URI of a TREE relation class (e.g.,tree:SubstringRelation
), and the values are the objects of thetree:value
properties. -
These scores are not used to order already discovered quads. Instead, they are used to prioritize TREE relations to discover those quads as quickly as possible. The
expectedValues
property indicates which TREE nodes we want to reach, without knowing whether or not this is possible.
The following actors are currently implemented:
- ActorTreeScoreExactPrefix: This actor will return the length of the common prefix between the expeced value and the found value, or
-Inf
if the found value is not a prefix of the expected value. This is similar to the ActorRdfScoreExactPrefix, except that the roles of the expected and found values are reversed.
ActorRdfMetadataExtractTree
Wraps the existing TREE metadata extraction library in a Comunica actor, and is used alongside the Comunica's existing metadata extraction actors.
ActorRdfResolveHypermediaLinksTreeAll
Takes the extracted hypermedia data, preferably from ActorRdfMetadataExtractTree, and extracts a list of all discovered hypermedia links.
ActorLiteralNormalize
Normalization is a common preprocessing in information retrieval systems to improve the system's recall by making it less sensitive to the various ways of encoding the same information in a literal value. in TREEmunica we provide a base class for normalizing literals, and subclasses of this class communicate which datatypes they can normalize.
We currently provide a single implementation that normalizes all xsd:string
and xsd:langString
literals to NFKD, and then discards all codepoints that aren't in the Letter (L), number (N), or Separator (Z) categories.
Development Setup
This repository is a fork of the Comunica repository, and is likewise managed as a monorepo using Lerna. The project can be setup by cloning and installing it as follows:
$ git clone https://github.com/TREEcg/treemunica.git
$ cd comunica
$ yarn install
Note: npm install
is not supported at the moment, as this project makes use of Yarn's workspaces functionality
This will install the dependencies of all modules, and bootstrap the Lerna monorepo. After that, all Comunica packages are available in the packages/
folder and can be used in a development environment.
Furthermore, this will add pre-commit hooks to build, lint and test. These hooks can temporarily be disabled at your own risk by adding the -n
flag to the commit command.
Cite
If you are using or extending Comunica as part of a scientific publication, we would appreciate a citation of our article.
@inproceedings{taelman_iswc_resources_comunica_2018,
author = {Taelman, Ruben and Van Herwegen, Joachim and Vander Sande, Miel and Verborgh, Ruben},
title = {Comunica: a Modular SPARQL Query Engine for the Web},
booktitle = {Proceedings of the 17th International Semantic Web Conference},
year = {2018},
month = oct,
url = {https://comunica.github.io/Article-ISWC2018-Resource/}
}
License
This code is copyrighted by Ghent University – imec and released under the MIT license.