wikipathways/rWikiPathways

How to extract edges from GPML file ?

Krithika-Bhuvan opened this issue · 2 comments

I’m trying to figure out how to extract edges from a network using the GPML file . I’m using the example of WP1589.gpml. I think I need to extract the information from the <Interaction> tag. These tags have <Point> child tags which I'm guessing will give me information about the edges.

Question 1 - See screenshot. If you see [[9]] and [[10]] in the list, they each have two <Point> tags that could represent the an edge from the “From node” to the “To node” . But see [[11]] in the list - it has 3 <Point> tags, and one <Anchor> tag. Which would be “From node” node and “To node” ? Can you clarify this ?
Link to screenshot here: https://drive.google.com/file/d/1nl3GnqLSbnQR5asWIkPqgcJKbjCh7tsH/view?usp=sharing

Question 2 - Attached is a csv file of the nodes I extracted with the help of Martin Morgan’s code in Slack. I found some GraphIDs (example “bf654”) which are in the <Interaction> section, but not defined previously in <DataNodes> section. So don’t know if this is a gene, or metabolite or anything else. How do I get around this ?
Link to csv file here: https://drive.google.com/file/d/1maGS4Im6jzkO2Tz1_4W9LkSTzpkp_N9T/view?usp=sharing

Hi. Before getting to your questions, let me also point you to our documentation for RDF versions of our content. This format is already structured for interaction-based use cases as each "triple" is an interaction. Basically, we've already done the hard work of extracting interactions from XML (GPML) and putting into a more interaction-friendly format.
https://www.wikipathways.org/index.php/Help:WikiPathways_Sparql_queries#Get_all_interactions_for_a_particular_pathway.

Q1 - In order to extract interactions from GPML, you'll want to focus on the <Point> tags that contain GraphRef. These point to <DataNode> GraphId and other defined entities in the GPML. The presence and meaning of ArrowHead in the Point tags will indicate the "from" and "to". Some interactions are undirected, so there is not an inherent "from" and "to" encoded.

Q2 - Interactions can connect lots of different things, including Groups, Shapes, Labels, and Anchors on other interactions. The subset of GraphRefs that point to <DataNode> GraphId are the ones that directly connect molecules. But the one that terminate on <Group> or other interaction anchors are obvioulsy also important for understanding how molecules are connected.

These are not simple interaction networks; they are curated pathway diagrams. Biology is messy and so are the models :). Again, one goal of the RDF is to simplify access to interactions, but it is all encoded in the GPML ultimately, if not obviously.