callahantiff/PheKnowLator

Set-up SPARQL Endpoint

Closed this issue · 8 comments

TASK

Task Type: PKT DATA DELIVERY

Select and set-up a SPARQL endpoint for exploring KG build data

TODO

  • Pick an endpoint. Here is a Medium article that compares and contrasts existing triplestores. Considering what we care about (Docker compatibility, RDF and Querying speed), I have selected a few and ordered them from best to worst:
  • Configure with CI/CD
  • Figure out where to host it
    • Through Google Cloud Run?

Questions:

  • @bill-baumgartner - Which one do you think we should use?
  • Two versions needed, 1 for our production build and 1 for those who want to build their own

Follow-up

  • How many concurrent queries can be run from each service on free plan? This matters if we are making this available to the public or just using it for the challenge

@bill-baumgartner - I successfully brought down the endpoint tonight 😄. It's running again, I restarted the container and it came back. The query I ran is shown below. It went down because I did not include LIMIT. I wonder if we should add something to protect from others doing this, or if there is something we can add to help it restart itself in these situations. Just something for us to discuss tomorrow!

PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?s ?p ?o 
WHERE { 
  VALUES ?p {
    obo:RO_0000087
    obo:RO_0002434
    rdfs:subClassOf
  }
  ?s ?p ?o 
} 

Note when adding LIMIT n the query executes totally fine. This query format is the template RH provided me so that's why I was testing it ot.

Good to know. From grepping our input n-triples file, we would expect the following numbers of responses:

  • subClassOf: 1,340,072
  • RO_0000087 (has_role): 40,500
  • RO_0002434 (interacts_with): 4,042,408

So, in total, this query would have eclipsed the 5M triple limit we had originally set. It may be the case that for queries with many results, users will need to request results in batches using ORDER BY + LIMIT + OFFSET.

I agree. I did some experimenting using the SPARQL Proxy settings (i.e. ENABLE_QUERY_SPLITTING and MAX_CHUNK_LIMIT) in docker-compose.yml and have some interesting insight to share in our meeting this afternoon. In a nutshell, I can get it to return all of the results, but then generate a different error when trying to return the results (which when using the ENABLE_QUERY_SPLITTING setting returns JSON). See below:

buffer.js:799
api_1      |     return this.utf8Slice(start, end);
api_1      |                 ^
api_1      | 
api_1      | Error: Cannot create a string longer than 0x1fffffe8 characters
api_1      |     at Buffer.toString (buffer.js:799:17)
api_1      |     at Request.<anonymous> (/app/node_modules/request/request.js:1128:39)
api_1      |     at Request.emit (events.js:315:20)
api_1      |     at IncomingMessage.<anonymous> (/app/node_modules/request/request.js:1076:12)
api_1      |     at Object.onceWrapper (events.js:421:28)
api_1      |     at IncomingMessage.emit (events.js:327:22)
api_1      |     at endReadableNT (internal/streams/readable.js:1327:12)
api_1      |     at processTicksAndRejections (internal/process/task_queues.js:80:21) {
api_1      |   code: 'ERR_STRING_TOO_LONG'
api_1      | }
api_1      | npm ERR! code ELIFECYCLE
api_1      | npm ERR! errno 1
api_1      | npm ERR! sparql-proxy@0.0.0 start: `node --experimental-modules src/server.mjs`
api_1      | npm ERR! Exit status 1
api_1      | npm ERR! 
api_1      | npm ERR! Failed at the sparql-proxy@0.0.0 start script.
api_1      | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
api_1      | 
api_1      | npm ERR! A complete log of this run can be found in:
api_1      | npm ERR!     /app/.npm/_logs/2021-01-08T18_52_49_474Z-debug.log

@bill-baumgartner so we have a record, here are the two queries we ran against the Endpoint via the command line:

wget Simple:

wget -qO- "http://35.233.212.30/blazegraph/sparql?query=select * where { ?s ?p ?o } " > filename.xml

wget Relation Template:.

wget -qO- "http://35.233.212.30/blazegraph/sparql?query=PREFIX%20obo%3A%20%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F%3E%20PREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%20SELECT%20%3Fs%20%3Fp%20%3Fo%20WHERE%20%7B%20VALUES%20%3Fp%20%7B%20obo%3ARO_0000087%20obo%3ARO_0002434%20rdfs%3AsubClassOf%20%7D%20%3Fs%20%3Fp%20%3Fo%20%7D" > filename.xml

UPDATE

Need to do the following things to fully address this issue:

  • Remove unneeded SPARQL proxy functionality
  • Incorporate workflow into GitHub Action
  • Need a script that monitors the Endpoint and alerts us if it goes down or is interrupted

When it goes down, run the following from the location shown below within the GCP instance:

~/PheKnowLator/builds/deploy/triple-store$ docker-compose up -d

@bill-baumgartner - I am going to close this for now. I think the 99% automated approach we are using now is totally fine as the endpoint is not something we plan to keep forever. Let me know if you disagree.