Use case: use CAM-KP-API to enhance edges
gaurav opened this issue · 9 comments
Given an edge, can CAM-KP API provide additional information on that edge, including:
- Which Noctua/Reactome pathways includes that edge
- Where in the body/cell does this pathway take place
- ...
Example: chemical-gene or gene-gene edge
We should probably try to get this to work before #537.
The hard part is to find some gene pairs that aren't working but should work, so perhaps what we need is a test file that's a list of genes and then we query them to see if we get the expected relationship.
Might be useful to add some exploration endpoints that are easier to work with (e.g. an endpoint that returns a list of models for a particular gene).
Question: can we say gene A and gene B are related if they are in the same model? Should we implement that?
- Since we don't have that, we need to find specific relations for this task.
"Causes influences" could be the relation between two genes that tells if they are related to each other within a model. This is a broad match of biolink:causes, but we only use exact matches, so that might not be accessible from CAM-KP-API. However, there is a set of manual mappings in https://github.com/ExposuresProvider/cam-pipeline/blob/cc13ef6ac7f4d48e91f77a789c71dec344512e1b/biolink-local.ttl that we might be able to access.
When testing TRAPI queries, we will need to make sure the RO relation we're inferring maps to a reasonable Biolink relation. Something confusing is that folks may search for causes
but some relevant relations map to affects
.
Here are two different ARAX queries that you can pull gene-chemical edges from, as described on slide 8 in this deck:
https://arax.ncats.io/?r=44679
https://arax.ncats.io/?r=52713
Sorry it's taken me so long to respond to this! These queries were super helpful in helping us find and fix some bugs in CAM-KP, and I think there might be more bugs lurking there. Here are my results.
As far as I can tell, out of all the edges @karafecho provides to us, only the edge between UniProtKB:P51589 and UniProtKB:P08684 returns results with a one-hop query. This is the following query:
{"message":{"query_graph":{"nodes":{"n0":{"ids":["UniProtKB:P51589"]},"n1":{"ids":["UniProtKB:P08684"]}},"edges":{"e0":{"predicates":["biolink:related_to"],"subject":"n0","object":"n1"}}}}}
Running this on our development instance returns 960 results, all of them being biolink:affects_activity_of
edges from the model http://model.geneontology.org/R-HSA-5423646. I'm not sure why there are so many results, but I'm going to dig into this further to see what's going on here.
Two-hop queries do a bit better, with:
- 360 results for CHEBI:34477-(?)-UniProtKB:P08684
- 144 results for CHEBI:63840-(?)-UniProtKB:P08684
- This has some interesting results, e.g.
CHEBI:63840("5'-hydroxyomeprazole") biolink:participates_in GO:0006739 ("NADP metabolic process") biolink:caused_by NCBIGene:100861540
- This has some interesting results, e.g.
- 1000+ results for (CHEBI:17996 or CHEBI:23114)-(?)-UniProtKB:P13569
- 1000+ results for UniProtKB:O75795-(?)-UniProtKB:P08684
- 1000+ results for UniProtKB:P16662-(?)-UniProtKB:P08684
- 1000+ results for UniProtKB:P19224-(?)-UniProtKB:P08684
- 1000+ results for UniProtKB:P22310-(?)-UniProtKB:P08684
- 1000+ results for UniProtKB:P54855-(?)-UniProtKB:P08684
- 1000+ results for UniProtKB:P24462-(?)-UniProtKB:P08684
- 1000+ results for UniProtKB:Q9HB55-(?)-UniProtKB:P08684
- 1000+ results for CHEBI:35703-(?)-UniProtKB:P08684
I used the query:
{"message":{"query_graph":{"nodes":{"n0":{"ids":["CHEBI:17996","CHEBI:23114"]},"n1":{},"n2":{"ids":["UniProtKB:P13569"]}},"edges":{"e0":{"predicates":["biolink:related_to"],"subject":"n0","object":"n1"},"e1":{"predicates":["biolink:related_to"],"subject":"n1","object":"n2"}}}}}
As you can see, UniProtKB:P08684 seems to be quite overrepresented in the results, and again it seems to me that we're seeing a lot more results than I would expect to see here.
I wonder if maybe we shouldn't need to do multihop queries to get these results -- whether we should have some related_to
triples connecting entities that have any relation with each other.
So, I think, next steps:
- Dig into the one-hop results and figure out what's going on there.
- Dig into the first two two-hop result sets, figure out if there's anything interesting in there, and if we should change our triplestore so that you can get these results with a one-hop query.
Thanks for your work on this, Gaurav.
The two-hop results indeed do look interesting, although I have not completed a deep dive.
Any updates, Gaurav? Happy to help if you point me in the right direction.
Hi Kara! My work on this issue currently revolves around the new /lookup
endpoint (#572): my goal is to have an endpoint that (1) normalizes input identifiers and (2) goes around the main SPARQL query we are currently using to query the triplestore directly to return everything we know about a particular identifier, in order to check whether the main SPARQL query is working correctly. This is primarily intended for debugging right now, but once that's done, I want to provide the ability to filter by an object as well -- so we should have an API endpoint that would allow you to query e.g. /lookup?subject=CHEBI:17685&object=GO:0019136&hopLimit=10
to find every relation between CHEBI:17685 and GO:0019136 across up to ten hops after normalizing both of those identifiers. I think that'll give us everything we need to enhance edges and double-check our SPARQL queries at the same time. I've gotten sidetracked by some database issues, but I'm hoping to have the basic /lookup
endpoint up by early next week, with support for filtering by an object identifier added soon thereafter. Happy to discuss any of this in a meeting if that would be useful!
This is all sounds great, Gaurav! I very much appreciate the effort.