/Ontario

Ontario: Federated SPARQL Query Processing Engine over Semantic Data Lakes

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

Ontario

Ontario: A Federated SPARQL Query Processor over Semantic Data Lakes

Using Ontario

Check the demo folder for dockerized examples.

Mapping file

chebi-tsv-mapping.ttl

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix chebi: <http://www.ebi.ac.uk/chebi/> .
@prefix : <http://tib.de/ontario/mapping#> .

:chebi_compound
  	rml:logicalSource [
                rml:source "compounds.tsv";
                rml:referenceFormulation ql:TSV;
                rml:iterator "*"
  			 ];
  	rr:subjectMap [
        rr:template "http://www.ebi.ac.uk/chebi/{ID}";
        rr:class chebi:Compound
  	];  	
    rr:predicateObjectMap [
      rr:predicate chebi:accession;
      rr:objectMap [
        rml:reference "CHEBI_ACCESSION"
      ]
    ];
    rr:predicateObjectMap [
      rr:predicate rdfs:label;
      rr:objectMap [
        rml:reference "NAME"
      ]
    ].

Configurations

To generate the RDF Molecule Templates, one should prepare a list of data sources with their mapping files (if any) as follows:

datasources.json

[
      {
        "name": "ChEBI-TSV",
        "ID": "http://iasis.eu/datasource/chebi-tsv",
        "url": "/home/user/data/ChEBI-TSV",
        "params": {
                "spark.driver.cores": "4",
                "spark.executor.cores": "4",
                "spark.cores.max": "6",
                "spark.default.parallelism": "4",
                "spark.executor.memory": "6g",
                "spark.driver.memory": "12g",
                "spark.driver.maxResultSize": "8g",
                "spark.python.worker.memory": "10g",
                "spark.local.dir": "/tmp"
        },
        "type": "LOCAL_TSV",
        "mappings": ["/home/user/git/Ontario/mappings/ChEBI/chebi-tsv-mapping.ttl"]
      }
  ]

Data Source type value can be one of the following:

    SPARQL_Endpoint    
    MySQL
    LOCAL_CSV
    LOCAL_TSV
    LOCAL_JSON
    LOCAL_XML
    HADOOP_CSV
    HADOOP_TSV
    HADOOP_JSON
    HADOOP_XML
    MongoDB
    Neo4j

Then run the following:

    python3 scripts/create_rdfmts.py -s datasources.json -o config.json
    

Then the RDF-MTs will be generated either by contacting the data sources or from the RML mappings. The content of the config.json file contains the following information:

{
  "templates": [ {  
        "rootType": "http://tib.eu/ontology/chebi/Compound",
        "datasources": [
                      {
                        "datasource": "http://iasis.eu/datasource/chebi-tsv",
                        "predicates": [
                          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
                          "http://www.w3.org/2000/01/rdf-schema#label",
                          "http://tib.eu/ontology/chebi/accession"
                          ]
                      }
                    ],
        "predicates": [
                      {
                        "predicate": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
                        "range": []
                      },
                      {
                        "predicate": "http://tib.eu/ontology/chebi/accession",
                        "range": []
                      },
                      {
                        "predicate": "http://www.w3.org/2000/01/rdf-schema#label",
                        "range": []
                      }
                    ]
     }
  ],
  "datasources": [
      {
        "name": "ChEBI-TSV",
        "ID": "http://iasis.eu/datasource/chebi-tsv",
        "url": "/home/user/data/ChEBI-TSV",
        "params": {
                "spark.driver.cores": "4",
                "spark.executor.cores": "4",
                "spark.cores.max": "6",
                "spark.default.parallelism": "4",
                "spark.executor.memory": "6g",
                "spark.driver.memory": "12g",
                "spark.driver.maxResultSize": "8g",
                "spark.python.worker.memory": "10g",
                "spark.local.dir": "/tmp"
        },
        "type": "LOCAL_TSV",
        "mappings": ["/home/user/git/Ontario/mappings/ChEBI/chebi-tsv-mapping.ttl"]
  }
  ]
}

Running Ontario

Ontario has been developed in python (3.x) and depends on some python packages to communicate with different databases and services. To install the required packages run:

    pip3 install -r requirements.txt

Install Ontario:

    python3 setup.py install

To run queries:

    ./runExperiment.py -q path/to/sparqlquery.txt -c path/to/config.json -p False

If you want to just see the plans, set -p True.

To run multiple queries in a folder:

    ./runOntarioExp.sh  /path/to/queriefolder/  path/to/config.json outputname.tsv  errorlog.txt False

If you want to just see the plans, set the last argument True

Creating Docker image

docker build -t ontario:0.5 .

You can use pre-built image of kemele/ontario:0.5

Using Ontario SPARQL endpoint

Currently Ontario as a SPARQL endpoint is supported on the docker version kemele/ontario:0.5.

import urllib.parse as urlparse
import requests
import json
params = urlparse.urlencode({'query': 'SELECT DISTINCT ?Concept WHERE{?s a ?Concept} LIMIT 5'})
resp = requests.get('http://localhost:5001/sparql', params=params)
if resp.status_code == 200:
    result = json.loads(resp.text)      
    print(result)

Output:

{'execTime': 0.15656578540802002,
 'firstResult': 0.15205996036529541,
 'totalRows': 5,
 'vars': ['Concept'],
 'result': [
            {'Concept': {'type': 'uri', 'value': 'http://bio2rdf.org/ns/kegg#Drug'}},
            {'Concept': {'type': 'uri', 'value': 'http://bio2rdf.org/ns/kegg#Enzyme'}},
            {'Concept': {'type': 'uri', 'value': 'http://bio2rdf.org/ns/kegg#Compound'}},
            {'Concept': {'type': 'uri', 'value': 'http://bio2rdf.org/ns/kegg#Reaction'}},
            {'Concept': {'type': 'uri', 'value': 'http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drug_interactions'}}
               ]
}

Publication:

Kemele M. Endris, Philipp D. Rohde, Maria-Esther Vidal, and Sören Auer. "Ontario: Federated Query Processing against a Semantic Data Lake." DEXA 2019 - Database and Expert Systems Applications. Lecture Notes in Computer Science. Springer, Cham (2019).

License

This work is licensed under GNU/GPL v2.