/graphdb-benchmarks

Application to benchmark Neo4j/Virtuoso querying

Primary LanguageJava

Benchmarking kNetMiner Data, Neo4j vs Virtuoso

This module is used to perform tests with KnetMiner data, encoded either as RDF or Neo4j, by means of the rdf2pg tool.

An older version of this work was presented with our paper presented at SWAT4HCLS 2018. A presentation from the workshop is also available.

Contents

Test Results

Results are summarised in the following figures. It is recommended that you first read this hereby document. See this Excel file for details.

Click on the images to see a bigger version.

Figure 1: Loading Performance

Figure 2: Query Performance

A detailed table is here.

Test Conditions

  • Hardware: MacBook Pro, 2.9 GHz Intel Core i7, 16GB RAM
  • Both the servers and the client (this package) are run on the same computer, thus network latency is minimsed
  • Only one server at a time is on while running a test of a given type (Neo4j/Virtuoso)

Test Approach

  • For each database (Neo4/Virtuoso) a number of query typed is tested (see below). For each query type a Cypher and a SPARQL version were written, aiming at keeping the same or very similar semantics, as well as similar graph patterns and other language constructs known to affect the database performance (e.g., filters, ORDER BY clauses).

  • Queries were written considering:

    • the typical query needs for our data
    • the aim to test particular query language operations and features
    • taking example from existing benchmarks (e.g., nestAg)
    • Certain queries are instantiated with parameters at each execution (e.g., joinFilter retrieves proteins by name, the name is a required parameter). For those cases, files with predefined parameter valued were prepared (taking values from the database). Every time the query has to be executed, a value is picked randomly and injected into the query.
    • Queries were written by first defining a data retrieval goal and then writing an implementation in both Cypher and SPARQL matching the goal as much as possible. Moreover, the respective language constructs we have used are chosen trying to replicate similar graph pattern structures and similar database engine challenges (e.g., [2union1Nest(results/src/main/assembly/resources/cypher/0130_2union1Nest.cypher) could be written by replicating branches, rather than unifying them with multiple WITH clauses, but the result would be significantly different than the corresponding SPARQL and would not contain the nested unions that the query is supposed to test).
  • There are two test types: Cypher and Sparql, which are run separately. Every test type is based on the procedure:

    1. The Database server is started
    2. A number of predefined iterations (usually a few thousands) are run, for each iteration:
    3. a query is randomly selected from the set of competence (Cypher, or SPARQL)
    4. if it's a parametric query, a random parameter is chosen (see above)
    5. the query is run, the execution time is tracked. We track the time going from when the query string is sent to the server to when the first result is fetched. This includes network latency, which we want include in the evaluation, for several reasons: 1. network latency is a small overhead and comparable between the two datbase engines (our primary goal is to compare the two) 1. in real use cases it is a relevant time
    6. At the end of all the iterations, the times of each query are averaged and results are reported.
    • Repeating the queries is done to get an average behavoir, running them in random order avoids biases like the exploitation of caches. We are not testing the parallel performance (i.e., many clients running queries simultaneously) since we're interested in comparing speeds with respect to the query types.

Test Data Sets

Each test type is run against database instances containing three different datasets:

Queries

All the queries listed below, and used in the tests, are based on the BioKNO ontology schematisation. Several of them are based on the graph pattern in figure, which models biological pathway relations in BioKNO.

Figure 3: Graph Pattern Used with Test Queries

Query List

  1. cnt: Counts instances, Cypher, SPARQL
  2. cntType: Instances of a given type, Cypher, SPARQL
  3. cntRel: Count relations, Cypher, SPARQL
  4. cntRelType: CountRelations of a given type, Cypher, SPARQL
  5. sel: Select entity and properties, Cypher, SPARQL
  6. join: Simple Join, Cypher, SPARQL
  7. joinRel: Join matching relation, Cypher, SPARQL
  8. joinFilter: Simple join + attribute filter, Cypher, SPARQL
  9. joinRe: Simple join + regex search, Cypher, SPARQL
  10. joinReif: Join through relation property, Cypher, SPARQL
  11. varPathC: Variable path query (max len), Cypher, SPARQL
  12. varPath: Variable path query (unbound len and top restricted), Cypher, SPARQL
  13. 2union: 2 unions, no nesting, Cypher, SPARQL
  14. 2union1Nest: 2 unions, 1 nesting, Cypher, SPARQL
  15. 2union1Nest+: 2 unions, 1 nesting (with Cypher CALL), Cypher, SPARQL
  16. pway: Complex union of paths over pathways, Cypher, SPARQL
  17. grp: Group by, Cypher, SPARQL
  18. grpAg: Group by + 2 aggregation functions, Cypher, SPARQL
  19. mulGrpAg: Multiple subqueries having aggregations , Cypher, SPARQL
  20. nestAg: Nested and outer aggregations (see Q6 from the Berlin benchmark), Cypher, SPARQL
  21. exist: Not exists, Cypher, SPARQL
  22. existAg: Not exists + aggregation, Cypher, SPARQL