Knowledge-Graph-Hub/kg-covid-19

Add a query for druggable 2nd order interactors that filters interactions by STRING combined_score

Opened this issue · 10 comments

@lpalbou made a ticket here so we can discuss this SPARQL query

@wdduncan here's the ticket if you want to help here

Basically we just want to change this SPARQL query:

  PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
  prefix owl: <http://www.w3.org/2002/07/owl#>
  prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  prefix bl: <https://w3id.org/biolink/vocab/>
  SELECT ?covp ?covplab ?humanp ?humanplab ?humanp2 ?humanp2lab ?drug ?druglab
  WHERE {
    VALUES ?covtaxon {"2697049"^^xsd:string }
    VALUES ?humantaxon { "9606"^^xsd:string }
  ?humanp bl:interacts_with ?covp .
  ?humanp bl:interacts_with ?humanp2 .
  ?covp bl:category bl:Protein; <https://www.example.org/UNKNOWN/ncbi_taxid> ?covtaxon .
  ?humanp bl:category bl:Protein; <https://www.example.org/UNKNOWN/ncbi_taxid> ?humantaxon .
  ?humanp2 bl:category bl:Protein; <https://www.example.org/UNKNOWN/ncbi_taxid> ?humantaxon .
  ?humanp2 <https://www.example.org/UNKNOWN/TDL> "Tclin"^^<http://www.w3.org/2001/XMLSchema#string> .
  ?drug bl:category bl:Drug; bl:interacts_with ?humanp2 .
  OPTIONAL { ?covp rdfs:label ?covplab } .
  OPTIONAL { ?humanp rdfs:label ?humanplab } .
  OPTIONAL { ?humanp2 rdfs:label ?humanp2lab } .
  OPTIONAL { ?drug rdfs:label ?druglab } .
  }

to filter out interactions with combined_scores < 700.

This ticket describes where the combined_scores live now

@justaddcoffee Is there a SPARQL endpoint I can test the query on?

Which relation holds the "scores"? Do you have a data property that connects to the literal?

@justaddcoffee
I re-organized the query a bit:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX bl: <https://w3id.org/biolink/vocab/>
PREFIX combined_score: <https://www.example.org/UNKNOWN/combined_score>
PREFIX ncbi_taxid: <https://www.example.org/UNKNOWN/ncbi_taxid>
PREFIX tdl:  <https://www.example.org/UNKNOWN/TDL>

SELECT 
	?combined_score 
	?covp 
	?covp_label 
	?humanp1 
	?humanp1_label 
	?humanp2 
	?humanp2_label
	?drug
	?drug_label
WHERE {
  # set taxon criteria
  # VALUES ?covtaxon {"2697049"^^xsd:string } 
  # ?covp ncbi_taxid: ?covtaxon .  
             
  VALUES ?humantaxon { "9606"^^xsd:string }
  ?humanp1 ncbi_taxid: ?humantaxon .
  ?humanp2 tdl: ?tdl;
           ncbi_taxid: ?humantaxon .
         
  ?score combined_score: ?combined_score . # find score edge and its value
  ?score bl:subject ?covp . # subject of the score edge
  ?covp bl:interacts_with ?humanp1 . # what covp interacts with
  ?humanp1 bl:interacts_with ?humanp1 . # what humanp1 interacts with
  
  # add drug interaction info
  ?drug bl:category bl:Drug; 
        bl:interacts_with ?humanp2 .
  
  filter (xsd:integer(?combined_score) < 700) # filter score values to less than 700
  filter (?tdl = "Tclin"^^<http://www.w3.org/2001/XMLSchema#string>)
  
  # find optional labels
  optional {?covp rdfs:label ?covp_label}
  optional {?humanp1 rdfs:label ?humanp1_label}
  optional {?humanp2 rdfs:label ?humanp2_label}
  optional {?drug rdfs:label ?drug_label}
 
 # filter (bound(?covp_label)) # just being curious
} LIMIT 100

I don't get any results when I uncomment this part:

 # VALUES ?covtaxon {"2697049"^^xsd:string }  
  # ?s1 ncbi_taxid: ?covtaxon .                                

So, I am may be (probably) misunderstanding something :(

For curiosity, I added the criteria (commented out above):

  filter (bound(?covp_label))

This returned results with ?covp_label as clp1l_human.

Also, while I was experimenting I had some time out errors. You may to tweak the server settings?

Hope this helps!

Thanks @wdduncan!

I don't get any results when I uncomment this part:
  # VALUES ?covtaxon {"2697049"^^xsd:string }  
  # ?s1 ncbi_taxid: ?covtaxon .   

I think what's going on here is that the ?covp -> ?humanp1 edges are coming from IntAct, not STRING. IntAct doesn't have combined_scores, since that's a STRING thing.

Is there a way to say here, look for a ?combined_score, but if it's not there ignore the filter() clause below? Or maybe some other SPARQL magic

  ?score combined_score: ?combined_score . # find score edge and its value
  ?score bl:subject ?covp . # subject of the score edge

  ...

  filter (xsd:integer(?combined_score) < 700) # filter score values to less than 700

@justaddcoffee I had a TYPO in my query :(

?humanp1 bl:interacts_with ?humanp1 .

Should be

?humanp1 bl:interacts_with ?humanp2 .

Sorry ....

I think this query does what you want:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX bl: <https://w3id.org/biolink/vocab/>
PREFIX combined_score: <https://www.example.org/UNKNOWN/combined_score>
PREFIX ncbi_taxid: <https://www.example.org/UNKNOWN/ncbi_taxid>
PREFIX tdl:  <https://www.example.org/UNKNOWN/TDL>

SELECT DISTINCT
	?combined_score 
	?covp 
	?covp_label 
	?humanp1 
	?humanp1_label 
	?humanp2 
	?humanp2_label
	?drug
	?drug_label
WHERE {
	VALUES ?humantaxon { "9606"^^xsd:string }
	?humanp1 ncbi_taxid: ?humantaxon .
	?humanp2 tdl: ?tdl;
             ncbi_taxid: ?humantaxon .
  
  	# specify the score association
	?score rdf:type bl:Association;
                      bl:subject ?covp .
  
  	# create a default value of -1 for non-existent combined scores
  	OPTIONAL {?score combined_score: ?val}
  	BIND( IF(!BOUND(?val), -1, xsd:integer(?val)) as ?combined_score )

  	# protein interactions
  	?covp bl:interacts_with ?humanp1 . # what covp interacts with
  	?humanp1 bl:interacts_with ?humanp2 . # what humanp1 interacts with

  	# add drug interaction info
 	 ?drug bl:category bl:Drug; 
           bl:interacts_with ?humanp2 .
  
  	FILTER (?tdl = "Tclin"^^<http://www.w3.org/2001/XMLSchema#string>)
  	FILTER (?combined_score < 700)
  
  	OPTIONAL { ?covp rdfs:label ?covp_label } .
  	OPTIONAL { ?humanp1 rdfs:label ?humanp1_label } .
  	OPTIONAL { ?humanp2 rdfs:label ?humanp2_label } .
  	OPTIONAL { ?drug rdfs:label ?drug_label } .
  

} LIMIT 100

It works by creating a default value of -1 when the combined score is not bound:

  OPTIONAL {?score combined_score: ?val}
  BIND( IF(!BOUND(?val), -1, xsd:integer(?val)) as ?combined_score )

However, when I remove the LIMIT 100 statement or try to search for specific scores (e.g., FILTER (?combined_score = -1) the query times out.

So, you may need to adjust some server settings.

Thanks @wdduncan and @lpalbou!

I made a few edits - see below - I think this is almost right

  • added taxon constraint to ?covp:
  VALUES ?covtaxon {"2697049"^^xsd:string } 
    ?covp ncbi_taxid: ?covtaxon .  
  • changed query to filter > 700 (not <)
  	FILTER (?combined_score > 700)
  • changed default combined_score to 1000 (this maybe is still right though)
  	BIND( IF(!BOUND(?val), 1000, xsd:integer(?val)) as ?combined_score )

This yields 18 distinct ?humanp2s, which is useful, but I'm not sure if this combined_score is right though.

What we want to do is keep interactions that don't have a combined_score, and filter interactions that have a score that is < 700

Here's the query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX bl: <https://w3id.org/biolink/vocab/>
PREFIX combined_score: <https://www.example.org/UNKNOWN/combined_score>
PREFIX ncbi_taxid: <https://www.example.org/UNKNOWN/ncbi_taxid>
PREFIX tdl:  <https://www.example.org/UNKNOWN/TDL>

SELECT DISTINCT
	?combined_score 
	?covp 
	?covp_label 
	?humanp1 
	?humanp1_label 
	?humanp2 
	?humanp2_label
	?drug
	?drug_label
WHERE {
  # set taxon criteria
  VALUES ?covtaxon {"2697049"^^xsd:string } 
    ?covp ncbi_taxid: ?covtaxon .  
  VALUES ?humantaxon { "9606"^^xsd:string }
	?humanp1 ncbi_taxid: ?humantaxon .
	?humanp2 tdl: ?tdl;
             ncbi_taxid: ?humantaxon .
  
  	# specify the score association
	?score rdf:type bl:Association;
                      bl:subject ?covp .
  
  	# create a default value of -1 for non-existent combined scores
  	OPTIONAL {?score combined_score: ?val}
  	BIND( IF(!BOUND(?val), 1000, xsd:integer(?val)) as ?combined_score )

  	# protein interactions
  	?covp bl:interacts_with ?humanp1 . # what covp interacts with
  	?humanp1 bl:interacts_with ?humanp2 . # what humanp1 interacts with

  	# add drug interaction info
 	 ?drug bl:category bl:Drug; 
           bl:interacts_with ?humanp2 .
  
  	FILTER (?tdl = "Tclin"^^<http://www.w3.org/2001/XMLSchema#string>)
  	FILTER (?combined_score > 700)
  
  	OPTIONAL { ?covp rdfs:label ?covp_label } .
  	OPTIONAL { ?humanp1 rdfs:label ?humanp1_label } .
  	OPTIONAL { ?humanp2 rdfs:label ?humanp2_label } .
  	OPTIONAL { ?drug rdfs:label ?drug_label } .
  

}

Sorry, I thought by "filter" you meant everything with a combined score < 700.

What we want to do is keep interactions that don't have a combined_score, and filter interactions that have a score that is < 700

The statement

OPTIONAL {?score combined_score: ?val}
BIND( IF(!BOUND(?val), 1000, xsd:integer(?val)) as ?combined_score )

is a hacky way of doing this. It just assigns an acceptable value to the null combined scores. You can set it something that makes them easily identifiable, e.g., "999999".

It might be useful to create a separate named graph or repository consisting of just the entities of interest. E.g., see the construct clause.

After you get the construct query working right, you then use the insert data clause to add the triples.

I second the server timeout settings request:)