Merck/Halyard

Halyard benchmarking -- how to improve?

Opened this issue · 5 comments

Hi Adam @asotona!

I have performed Halyard benchmarking on 1 node setup (i7-3770 3.4GHz, 32GB RAM, normal HDD) --> HDFS + YARN + HBase + Halyard. The querying was done via rdf4j-server SPARQL endpoint. e.g.:

wget -O - "http://halyard/rdf4j-server/repositories/benchmark50?query=select%20%2A%20%7B%3Fs%20%3Fp%20%3Fo%7D%20limit%2010"

I have used FEASIBLE [1] benchmark queries and IGUANA [2]. The configuration for the benchmarking is available in halyard docker repository [3] (iguana-config.tar.bz2).
As you can see from the benchmarking results for the smallest size Halyard could answer only 6 queries, for larger sizes (50 and 100) Halyard answered 0 queries.

From preliminary discussions: it is possible to query Halyard using Java interface and it should improve the performance. Is there any example on how to do that?

[1] http://aksw.org/Projects/FEASIBLE.html
[2] http://aksw.org/Projects/IGUANA.html
[3] https://github.com/earthquakesan/docker-halyard

upd:

did not add benchmarking results to the github, they are here: https://www.dropbox.com/s/st5sz0hu7eoxj8l/benchmark_results.tar.bz2?dl=0

I'll take a look at it, there might be many configuration reasons why HBase does not perform well on a single-node cluster. And there might be also reason in Halyard query evaluation and the benchmarking queries.

I have found better performance when the 'Push' option is not enabled. There are probably issues with some queries (such as path queries) with that option. Have you tested without the Push option enabled?

@earthquakesan Hi Ivan, have you made it to work in a multi node cluster as well? Thanks