RDFLib/rdflib-hdt

Querying multiple HDT graphs

chiarcos opened this issue · 1 comments

This is more like a question: Is it possible to query with SPARQL over more than one HDT file or over an HDT file and a regular RDF graph? The points below apply only under the assumption that it is not (I didn't figure out how).

Is your feature request related to a problem? Please describe.
We use the WordNet HDT and several other HDT graphs with bilingual dictionaries. We want to query WordNet + dictionaries for getting (possible) synsets from WordNet (HDT1) for (say) Spanish words via a Spanish-English dictionary (HDT2).

Describe the solution you'd like
Treat multiple HDT files as single graphs and access them via SPARQL GRAPH (FROM, USING, etc.).
Downside is that HDT graphs are read-only (I guess), and if HDT graphs are freely mixed with writable graphs, users may be tempted to write into them and get frustrated from the results.

Describe alternatives you've considered
Overload the SERVICE keyword, i.e.

  • If an HDT file is read, allow users to assign it an identifier (let's call that "service URI")
  • When evaluating SPARQL queries, if a SERVICE is evoked, check whether it's a pre-registered service URI, if so, return the results of SELECT * {...} from the HDT file, otherwise evaluate using the standard implementation for SERVICE
  • This could probably be done by means of a SPARQL extension

In terms of ease of implementation and user experience, this may be the preferred solution, but I feel this is a bit of a hack and it will produce SPARQL queries that can probably not be ported to other HDT implementations (unless they adopt the same strategy).

Alternatives that don't apply
In theory, we could actually use the standard implementation of SERVICE, but that would require to set up one end point per HDT file, it would slow things down and create considerable overhead both in coding and communication. This may not be much of an issue if we're consulting just two HDT files, but we have plans to do that on a massively multilingual scale, so we might end up with dozens or hundreds of HDT files per query.

Additional context
That seems to be a feature of the Jena integration

Apologies, this turned out to be a non-issue, because this functionality is indirectly provided via ConjunctiveGraph (I guess). The following works (replace with your own hdt files to replicate):

import rdflib
import rdflib_hdt

graph1=rdflib.Graph(store=rdflib_hdt.HDTStore("../../models/kgs/verbnet/verbnet.hdt"))
print(graph1, len(graph1.query("SELECT DISTINCT * { ?x ?y ?z }")))

graph2=rdflib.Graph(store=rdflib_hdt.HDTStore("../../samples/hdt/all.hdt"))
print(graph2, len(graph2.query("SELECT DISTINCT * { ?x ?y ?z }")))

graph=graph1+graph2
print(graph, len(graph.query("SELECT DISTINCT * { ?x ?y ?z }")))