In this repository you can find the data files and queries used in the benchmarking section for WDBench.
The data used in this benchmark is based on the Wikidata Truthy from 2021-06-23. We cleaned the data removing all triples whose predicate is not a direct property (i.e http://www.wikidata.org/prop/direct/P*
). The data is available to download from Figshare.
The script to generate these data from the original data is in our source folder.
Apache Jena requires Java JDK (we used Openjdk 11, other versions might work as well)
The installation may be different depending on your Linux distribution. For Debian/Ubuntu based distributions:
sudo apt update
sudo apt install openjdk-11-jdk
You can download Apache Jena from their website . The file you need to download will look like apache-jena-4.X.Y.tar.gz
, in our case, we used the version 4.1.0
, but this should also work for newer versions.
tar -xf apache-jena-4.*.*.tar.gz
cd apache-jena-4.*.*/
bin/tdbloader2 --loc=[path_of_new_db_folder] [path_of_nt_file]
This step is necessary only if you want to use the Leapfrog Jena implementation, you can skip this otherwise.
Edit the text file bin/tdbloader2index
and search for the lines:
generate_index "$K1 $K2 $K3" "$DATA_TRIPLES" SPO
generate_index "$K2 $K3 $K1" "$DATA_TRIPLES" POS
generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
After those lines add:
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP
generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO
generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
Now you can execute the bulk import in the same way we did it before:
bin/tdbloader2 --loc=[path_of_new_db_folder] [path_of_nt_file]
In order to be able to run the benchmark for Leapfrog Jena you also need to use a custom fuseki-server.jar
- Install openjdk-8, mvn and set
JAVA_HOME
to use java 8 git clone https://github.com/cirojas/jena-leapfrog
cd jena-leapfrog
mvn clean install -Drat.numUnapprovedLicenses=100 -Darguments="-Dmaven.javadoc.skip=true" -DskipTests
- Use
jena-fuseki2/apache-jena-fuseki/target/apache-jena-fuseki-3.9.0.tar.gz
instead of the one you download normally.
Virtuoso has a problem with geo-datatypes so we generated a new .nt file to prevent them from being parsed as a geo-datatype.
sed 's/#wktLiteral/#wktliteral/g' [path_of_nt_file] > [virtuoso_nt_file]
You can download Virtuoso from their github. We used Virtuoso Open Source Edition, version 7.2.6.
- Download:
wget https://github.com/openlink/virtuoso-opensource/releases/download/v7.2.6.1/virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Extract:
tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Enter to the folder:
cd virtuoso-opensource
-
We start from their example configuration file:
cp database/virtuoso.ini.sample wikidata.ini
-
Edit
wikidata.ini
with a text editor, when you edit a path, we recommend using the absolute path:-
replace every
../database/
with the path of the database folder you want to create. -
add the path of folder where you have
[virtuoso_nt_file]
and the path of the database folder you want to create toDirsAllowed
. -
change
VADInstallDir
to the path ofvirtuoso-opensource/vad
. -
set
NumberOfBuffers
. For loading the data we used7200000
, to run experiments we used5450000
. -
set
MaxDirtyBuffers
. For loading the data we used5400000
, to run experiments we used4000000
. -
revise
ResultSetMaxRows
, our experiments set this to1000000
-
revise
MaxQueryCostEstimationTime
, our experiments commented this out with ';' before the line removing the limit -
revise
MaxQueryExecutionTime
, our experiments used60
for 1 minute timeouts -
add at the end of the file these lines:
[Flags] tn_max_memory = 2755359740
-
-
Start the server:
bin/virtuoso-t -c wikidata.ini +foreground
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the import ends. Run the next command in another terminal.
-
bin/isql localhost:1111
And inside the
isql
console run:ld_dir('[path_to_virtuoso_folder]', '[virtuoso_nt_file]', 'http://wikidata.org/);
rdf_loader_run();
You'll need the following prerequisites installed:
- Java JDK (with
$JAVA_HOME
defined and$JAVA_HOME/bin
on$PATH
) - Maven
- Git
The installation may be different depending on your Linux distribution. For Debian/Ubuntu based distributions:
sudo apt update
sudo apt install openjdk-11-jdk mvn git
Blazegraph can't load big files in a reasonable time, so we need to split the .nt into smaller files (1M each)
mkdir splitted_nt
cd splitted_nt
split -l 1000000 -a 4 -d --additional-suffix=.nt [path_to_nt]
cd ..
git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf
cd wikidata-query-rdf
mvn package
cd dist/target
tar xvzf service-*-dist.tar.gz
cd service-*/
mkdir logs
- Edit the script file
runBlazegraph.sh
with any text editor.- configure main memory here:
HEAP_SIZE=${HEAP_SIZE:-"64g"}
(You may use other value depending on how much RAM your machine has) - set the log folder
LOG_DIR=${LOG_DIR:-"/path/to/logs"}
, replace/path/to/logs
with the absolute path of thelogs
dir you created in the previous step. - add
-Dorg.wikidata.query.rdf.tool.rdf.RdfRepository.timeout=60
to theexec java
command to specify the timeout (value is in seconds). - also change
-Dcom.bigdata.rdf.sparql.ast.QueryHints.analyticMaxMemoryPerQuery=0
which removes per-query memory limits.
- configure main memory here:
- Start the server:
./runBlazegraph.sh
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the import ends. Run the next command in another terminal.
- Start the import:
./loadRestAPI.sh -n wdq -d [path_of_splitted_nt_folder]
- Download Neo4J community edition from their website https://neo4j.com/download-center/#community . We used the version 4.3.5 but this instructions might work for newer versions.
- Extract the downloaded file
tar -xf neo4j-community-4.*.*-unix.tar.gz
- Enter to the folder:
cd neo4j-community-4.*.*/
- Set the variable
$NEO4J_HOME
pointing to the Neo4J folder (usingexport
and adding it to .bashrc/.zshrc)
Edit the text file conf/neo4j.conf
- Set
dbms.default_database=wikidata
- Uncomment the line
dbms.security.auth_enabled=false
- Add the line
dbms.transaction.timeout=10m
Use the script nt_to_neo4j.py to generate the .csv files entities.csv
, literals.csv
and edges.csv
Execute the data import (the command supposes that .csv
files are in the wikidata_csv
folder)
bin/neo4j-admin import --database wikidata \
--nodes=Entity=wikidata_csv/entities.csv \
--nodes wikidata_csv/literals.csv \
--relationships wikidata_csv/edges.csv \
--delimiter "," --array-delimiter ";" --skip-bad-relationships true
Now we have to create the index for entities:
-
Start the server:
bin/neo4j console
- This process won't end until you interrupt it (Ctrl+C). Let this execute until the index creation is finished. Run the next command in another terminal.
-
Open the cypher console:
bin/cypher-shell
, and inside the console run the command:CREATE INDEX ON :Entity(id);
- Even though the above command returns immediately, you have to wait until is finished before interrupting the server. You can see the status of the index with the command
SHOW INDEXES;
In this benchmark we have 5 sets of queries:
- Basic Graph Patterns (BGPs):
- Single BGPs : 280 queries
- Multiple BGPs: 681 queries
- Optionals: 498
- Property Paths : 660 queries
- C2RPQs: 539 queries
We provide the SPARQL queries in our queries folder. Also we provide the equivalent cypher property paths (it has fewer queries because some property paths cannot be expressed in cypher).
Single BGPs, Multiple BGPs and Property Paths are based on real queries extracted from the Wikidata SPARQL query log. This log contains
millions of queries, but many of them are trivial to evaluate.
We thus decided to generate our benchmark from more challenging
cases, i.e., a smaller log of queries that timed-out on the Wikidata
public endpoint. From these queries we extracted their BGPs and property paths removing duplicates (modulo isomorphism on query variables).
Then we filtered with the same criteria that we applied to the data, removing all queries having predicates that are not a direct property (http://www.wikidata.org/prop/direct/P*
). Next, for property paths we removed queries that have both subject and object as variables and for BGPs we removed queries having a triple in which subject, predicate and object are variables.
Finally, we distinguish BGPs queries consisting of a single triple pattern (Single BGPs) from those containing more than one triple pattern (Multiple BGPs).
Here we provide a description of the scripts we used for the execution of the queries.
Our scripts will execute a list of queries for a certain engine, one at a time, and register the time and number of results for each query in a csv file.
Every time you want to run a benchmark script you must clear the cache of the system before. To do this, run as root:
sync; echo 3 > /proc/sys/vm/drop_caches
Then you have 2 scripts for executing the benchmarks, one for SPARQL engines and another for NEO4J. They are placed in the Execution
folder.
Each script has a parameters section near the beginning of the file, (e.g. database paths, output folder) make sure to edit the script to set them properly.
To execute the benchmark for a SPARQL engine you need to pass 4 parameters:
- the engine name (
JENA
,BLAZEGRAPH
orVIRTUOSO
). - the limit for the queries
- the absolute path to the query file.
- Any name you want to give as a prefix for the output file.
E.g.
python Execution/benchmark.py JENA ~/WDBench/Queries/paths.txt 100000 paths
To execute the benchmark for NEO4J you need to manually start the server after clearing the cache. Then you can execute the script passing 3 parameters:
- the path (absolute or relative) to the query file.
- the limit for the queries
- Any name you want to give as a prefix for the output file.
E.g.
python Execution/cypher_benchmark.py Cypher/cypher_opts.txt 100000 opts