This module is used to perform tests with KnetMiner data, encoded either as RDF or Neo4j, by means of the rdf2pg tool.
An older version of this work was presented with our paper presented at SWAT4HCLS 2018. A presentation from the workshop is also available.
Results are summarised in the following figures. It is recommended that you first read this hereby document. See this Excel file for details.
Click on the images to see a bigger version.
A detailed table is here.
- Hardware: MacBook Pro, 2.9 GHz Intel Core i7, 16GB RAM
- Both the servers and the client (this package) are run on the same computer, thus network latency is minimsed
- Only one server at a time is on while running a test of a given type (Neo4j/Virtuoso)
-
For each database (Neo4/Virtuoso) a number of query typed is tested (see below). For each query type a Cypher and a SPARQL version were written, aiming at keeping the same or very similar semantics, as well as similar graph patterns and other language constructs known to affect the database performance (e.g., filters,
ORDER BY
clauses). -
Queries were written considering:
- the typical query needs for our data
- the aim to test particular query language operations and features
- taking example from existing benchmarks (e.g.,
nestAg
) - Certain queries are instantiated with parameters at each execution (e.g.,
joinFilter
retrieves proteins by name, the name is a required parameter). For those cases, files with predefined parameter valued were prepared (taking values from the database). Every time the query has to be executed, a value is picked randomly and injected into the query. - Queries were written by first defining a data retrieval goal and then writing an implementation in both Cypher and SPARQL matching the goal as much as possible. Moreover, the respective language constructs we have used are chosen trying to replicate similar graph pattern structures and similar database engine challenges (e.g., [2union1Nest(results/src/main/assembly/resources/cypher/0130_2union1Nest.cypher) could be written by replicating branches, rather than unifying them with multiple WITH clauses, but the result would be significantly different than the corresponding SPARQL and would not contain the nested unions that the query is supposed to test).
-
There are two test types: Cypher and Sparql, which are run separately. Every test type is based on the procedure:
- The Database server is started
- A number of predefined iterations (usually a few thousands) are run, for each iteration:
- a query is randomly selected from the set of competence (Cypher, or SPARQL)
- if it's a parametric query, a random parameter is chosen (see above)
- the query is run, the execution time is tracked. We track the time going from when the query string is sent to the server to when the first result is fetched. This includes network latency, which we want include in the evaluation, for several reasons: 1. network latency is a small overhead and comparable between the two datbase engines (our primary goal is to compare the two) 1. in real use cases it is a relevant time
- At the end of all the iterations, the times of each query are averaged and results are reported.
- Repeating the queries is done to get an average behavoir, running them in random order avoids biases like the exploitation of caches. We are not testing the parallel performance (i.e., many clients running queries simultaneously) since we're interested in comparing speeds with respect to the query types.
Each test type is run against database instances containing three different datasets:
- BioPax: a small dataset with BioPAX and GeneOntology data. RDF dump. Neo4j dump
- Arabidopsis: the kNetMiner data set about arabidopsis, medium size. RDF dump. Neo4j dump
- Wheat: the kNetMiner data set about wheat, biggest size. RDF dump. Neo4j dump.
All the queries listed below, and used in the tests, are based on the BioKNO ontology schematisation. Several of them are based on the graph pattern in figure, which models biological pathway relations in BioKNO.
- cnt: Counts instances, Cypher, SPARQL
- cntType: Instances of a given type, Cypher, SPARQL
- cntRel: Count relations, Cypher, SPARQL
- cntRelType: CountRelations of a given type, Cypher, SPARQL
- sel: Select entity and properties, Cypher, SPARQL
- join: Simple Join, Cypher, SPARQL
- joinRel: Join matching relation, Cypher, SPARQL
- joinFilter: Simple join + attribute filter, Cypher, SPARQL
- joinRe: Simple join + regex search, Cypher, SPARQL
- joinReif: Join through relation property, Cypher, SPARQL
- varPathC: Variable path query (max len), Cypher, SPARQL
- varPath: Variable path query (unbound len and top restricted), Cypher, SPARQL
- 2union: 2 unions, no nesting, Cypher, SPARQL
- 2union1Nest: 2 unions, 1 nesting, Cypher, SPARQL
- 2union1Nest+: 2 unions, 1 nesting (with Cypher CALL), Cypher, SPARQL
- pway: Complex union of paths over pathways, Cypher, SPARQL
- grp: Group by, Cypher, SPARQL
- grpAg: Group by + 2 aggregation functions, Cypher, SPARQL
- mulGrpAg: Multiple subqueries having aggregations , Cypher, SPARQL
- nestAg: Nested and outer aggregations (see Q6 from the Berlin benchmark), Cypher, SPARQL
- exist: Not exists, Cypher, SPARQL
- existAg: Not exists + aggregation, Cypher, SPARQL