A REST API to query SPARQL endpoints using Apache Superset BI tool using the elasticsearc-dbapi package.
Bio2RDF project was beneficial to the life science community because it provided a common data integration framework that allowed researchers to easily combine and analyze data from different sources. Since then, many life science data providers have adopted the RDF publication approach but, after more than 15 years, we can not say that semantic web technology was widely adopted by bioinformaticians.
The lack of data visualization tools can hinder the development and usability of semantic web applications by making it difficult to understand and work with complex data structures and relationships. Since the Facet browser of Virtuoso, a web tool that explores RDF triplestore by facet browsing, very few frameworks have eased the building of a dashboard. Although commercial software, such as Ontotext and Stardog, are offering dashboard tools, they cannot match the simplicity and popularity of Tableau and PowerBI.
A mature business intelligence (BI) software tool capable of building a dashboard from SPARQL endpoints could provide significant benefits to the life science community by enabling efficient data analysis, integrating data sources, designing data visualization, and promoting collaboration and sharing.
The goal of this project is to modify an existing business intelligence (BI) tool to enable it to consume SPARQL endpoints. We aim to modify an existing BI tool that will make it possible to build a dashboard from RDF resources in the fields of multi-omics analysis, genomics, transcriptomics, epigenomics, proteomics, protein structures, and biochemical data.
For this project, we have selected Superset as the dashboard technology. Superset is an open-source Apache BI software that is compatible with many cloud-based SQL technologies. One such technology is Elasticsearch, for which a Python package (https://pypi.org/project/elasticsearch-dbapi/) has made it possible to enable communication between Superset and Elasticsearch using the existing SQL REST API.
Our strategy is to first create a REST API that mimics Elasticsearch behavior using a QLever SPARQL endpoint containing Uniprot data (https://qlever.cs.uni-freiburg.de/uniprot). Superset, an open source Apache BI software tool compatible with various cloud-based SQL technologies, uses SQLAlchemy (https://www.sqlalchemy.org/) to generate SQL queries submitted to cloud datasource APIs. Our main task will be to translate SQL statements generated by Superset's dashboard building interface into their SPARQL equivalent queries, which we will submit to the QLever SPARQL endpoint. An example of SQL to SPARQL conversion can be found in the project's GitHub repository, where ChatGPT has already performed the translation. In this project, we aim to create a SQL2SPARQL package using Python to perform the same translation. SQL and SPARQL query parsers already exist, so we only need to build the necessary translator.
Once the project is successful, we will create a sparql-dbapi Python package to enable Superset/SPARQL connectivity. This will allow us to build dashboards for Life Science users using existing SPARQL resources, and promote further adoption of the RDF data model in the Life Science field.
If you would like to participate to this project open an issue with your comments and I will contact you.
SELECT enzyme.enzyme_name, COUNT(DISTINCT reaction_protein.protein) AS count, publication.publication_title
FROM reaction_enzyme
JOIN enzyme ON reaction_enzyme.enzyme = enzyme.enzyme_uri
JOIN reaction_protein ON reaction_enzyme.reaction = reaction_protein.reaction
JOIN publication ON publication.publication_uri IN (
SELECT DISTINCT publication.publication_uri
FROM enzyme
JOIN publication ON publication.publication_uri IN (
SELECT DISTINCT mentions.publication_uri
FROM mentions
WHERE mentions.enzyme_uri = enzyme.enzyme_uri
)
)
GROUP BY enzyme.enzyme_name, publication.publication_title
ORDER BY count DESC
converted to
PREFIX bio2rdf: <http://bio2rdf.org/>
SELECT ?enzyme (COUNT(DISTINCT ?protein) AS ?count) ?title
WHERE {
?reaction a bio2rdf:Reaction .
?reaction bio2rdf:reaction_enzyme ?enzyme .
?reaction bio2rdf:reaction_protein ?protein .
{
SELECT ?enzyme ?title
WHERE {
?enzyme a bio2rdf:Enzyme .
?enzyme bio2rdf:enzyme_name ?name .
?publication a bio2rdf:Publication .
?publication bio2rdf:mentions ?enzyme .
?publication bio2rdf:publication_title ?title .
}
}
}
GROUP BY ?enzyme ?title
ORDER BY DESC(?count)