mchaloupka/EVI

SQL query performance

tpluscode opened this issue · 4 comments

Hi

I've just tried to run some more complex queries and I'm concerned with the SQL produced.

First issue I see is that each property object map generates a unioned SELECT. Wouldn't it be possible to generate a single SQL SELECT and transform the columns into results accordingly?

Then the number of SELECTs seems to explode for each triple pattern in SPARQL.

And lastly when I added a second triple pattern joined by a ref-object map the SQL literally explodes. In my query I select the main triples for a subject and the next level to include a blank node resulting from the second triple map. Normally I would use OPTIONAL though it's not yet implemented.

CONSTRUCT
{
  @subject ?p ?o .
  ?blank ?p2 ?o1 .
}
WHERE 
{ 
  @subject ?p ?o .
  @subject ?p1 ?blank . 
  ?blank ?p2 ?o1 .
}

Such query executed against r2rml with 2 triple maps each with 9 and 1 predicate object map respectively produces and SQL query already 4700 lines long. Add an rdf:type pattern and it's another 2500 lines more. With a more complex triple pattern I got a whopping 14k lines.

You actually hit the worst case for my solution :)

I will try to explain why it is working this way. Now, it is not possible to generate a single select, because I can't be sure that the result will be correct. I need database information, that the join uses the same table and at least one of the joining columns is unique (will be done as a part of issue #6). Otherwise, you need to make a complete join.

The biggest problem of your query is, that there is no info what to join. I do not see the R2RML mapping, but I expect that the algorithm has no way to decide which subject-predicate-object mapping can be the source. So the optimizing mechanism has not any chance to optimize the query and it generates the full query. Maybe, in the future I can try to detect these scenarios and in that case I should not expand the unions over joins.

Can you please provide the database schema and r2rml file?

Here are the scripts to initialize the datatbase. And here are the mappings. I have added some more maps since we last wrote.

In the current version, I do not support custom SQL expression as logical table. I am not able to perform any optimizations upon them. I will support them in some later version. Maybe I will need to extend the mapping file a little bit. However, if you will use just tables directly, you will see that the performance is improved. Now, I will work on the support of other SPARQL queries than just basic triples.