LOD-Laundromat/Frank

Optimize namespace filtering

Closed this issue · 1 comments

The namespace index and sparql query of frank documents now work independently. How it works atm:

  • If only a namespace flag is provided, the documents for this ns are fetched from the index and returned directly.
  • If no flags are provided, or flags using C-LOD, then the SPARQL endpoint is used
  • If both a ns flag and e.g. minTriples are provided, both the index and the endpoint are used. First the namespace docs are fetched, then the SPARQL query is executed. For each sparql result, we check whether it occurs in the ns doc list. If it does, we print it. Otherwise, we leave it be.

The last one is problematic. Suppose we get just 1 document for a certain namespace, then we still loop through all the SPARQL results (possible ~ 650.000), making it very slow.

The solution would be to combine both the index and sparql query, i.e. filter the documents in the sparql query using the documents from the ns index directly (using the values clause).
But, we should avoid sending a sparql request of around ~20MB: Suppose we get 600.000 docs from the ns index. Adding these to the SPARQL query would increase its size drastically.
I.e., the solution:

  • If the namespace index is below a certain size, include it in the SPARQL query.
  • Otherwise, filter the sparql results after execution

done. max number of document references to include in sparql query: 50.000