Benchmarking kNetMiner Data, Neo4j vs Virtuoso

This module is used to perform tests with KnetMiner data, encoded either as RDF or Neo4j, by means of the rdf2pg tool.

An older version of this work was presented with our paper presented at SWAT4HCLS 2018. A presentation from the workshop is also available.

Contents
Test Results
- Figure 1: Loading Performance
- Figure 2: Query Performance
Test Conditions
Test Approach
Test Data Sets
Queries
- Figure 3: Graph Pattern Used with Test Queries
- Query List

Test Results

Results are summarised in the following figures. It is recommended that you first read this hereby document. See this Excel file for details.

Click on the images to see a bigger version.

Figure 1: Loading Performance

Figure 2: Query Performance

A detailed table is here.

Test Conditions

Hardware: MacBook Pro, 2.9 GHz Intel Core i7, 16GB RAM
Both the servers and the client (this package) are run on the same computer, thus network latency is minimsed
Only one server at a time is on while running a test of a given type (Neo4j/Virtuoso)

Test Approach

For each database (Neo4/Virtuoso) a number of query typed is tested (see below). For each query type a Cypher and a SPARQL version were written, aiming at keeping the same or very similar semantics, as well as similar graph patterns and other language constructs known to affect the database performance (e.g., filters, ORDER BY clauses).
Queries were written considering:
- the typical query needs for our data
- the aim to test particular query language operations and features
- taking example from existing benchmarks (e.g., nestAg)
- Certain queries are instantiated with parameters at each execution (e.g., joinFilter retrieves proteins by name, the name is a required parameter). For those cases, files with predefined parameter valued were prepared (taking values from the database). Every time the query has to be executed, a value is picked randomly and injected into the query.
- Queries were written by first defining a data retrieval goal and then writing an implementation in both Cypher and SPARQL matching the goal as much as possible. Moreover, the respective language constructs we have used are chosen trying to replicate similar graph pattern structures and similar database engine challenges (e.g., [2union1Nest(results/src/main/assembly/resources/cypher/0130_2union1Nest.cypher) could be written by replicating branches, rather than unifying them with multiple WITH clauses, but the result would be significantly different than the corresponding SPARQL and would not contain the nested unions that the query is supposed to test).
There are two test types: Cypher and Sparql, which are run separately. Every test type is based on the procedure:
1. The Database server is started
2. A number of predefined iterations (usually a few thousands) are run, for each iteration:
3. a query is randomly selected from the set of competence (Cypher, or SPARQL)
4. if it's a parametric query, a random parameter is chosen (see above)
5. the query is run, the execution time is tracked. We track the time going from when the query string is sent to the server to when the first result is fetched. This includes network latency, which we want include in the evaluation, for several reasons: 1. network latency is a small overhead and comparable between the two datbase engines (our primary goal is to compare the two) 1. in real use cases it is a relevant time
6. At the end of all the iterations, the times of each query are averaged and results are reported.
- Repeating the queries is done to get an average behavoir, running them in random order avoids biases like the exploitation of caches. We are not testing the parallel performance (i.e., many clients running queries simultaneously) since we're interested in comparing speeds with respect to the query types.

Test Data Sets

Each test type is run against database instances containing three different datasets:

BioPax: a small dataset with BioPAX and GeneOntology data. RDF dump. Neo4j dump
Arabidopsis: the kNetMiner data set about arabidopsis, medium size. RDF dump. Neo4j dump
Wheat: the kNetMiner data set about wheat, biggest size. RDF dump. Neo4j dump.

Queries

All the queries listed below, and used in the tests, are based on the BioKNO ontology schematisation. Several of them are based on the graph pattern in figure, which models biological pathway relations in BioKNO.

Figure 3: Graph Pattern Used with Test Queries

Query List

cnt: Counts instances, Cypher, SPARQL
cntType: Instances of a given type, Cypher, SPARQL
cntRel: Count relations, Cypher, SPARQL
cntRelType: CountRelations of a given type, Cypher, SPARQL
sel: Select entity and properties, Cypher, SPARQL
join: Simple Join, Cypher, SPARQL
joinRel: Join matching relation, Cypher, SPARQL
joinFilter: Simple join + attribute filter, Cypher, SPARQL
joinRe: Simple join + regex search, Cypher, SPARQL
joinReif: Join through relation property, Cypher, SPARQL
varPathC: Variable path query (max len), Cypher, SPARQL
varPath: Variable path query (unbound len and top restricted), Cypher, SPARQL
2union: 2 unions, no nesting, Cypher, SPARQL
2union1Nest: 2 unions, 1 nesting, Cypher, SPARQL
2union1Nest+: 2 unions, 1 nesting (with Cypher CALL), Cypher, SPARQL
pway: Complex union of paths over pathways, Cypher, SPARQL
grp: Group by, Cypher, SPARQL
grpAg: Group by + 2 aggregation functions, Cypher, SPARQL
mulGrpAg: Multiple subqueries having aggregations , Cypher, SPARQL
nestAg: Nested and outer aggregations (see Q6 from the Berlin benchmark), Cypher, SPARQL
exist: Not exists, Cypher, SPARQL
existAg: Not exists + aggregation, Cypher, SPARQL

lvca/graphdb-benchmarks