Returning a "full ARG"
kitchensjn opened this issue · 11 comments
Would it be possible to have SINGER output the "full ARG" with marked recombination nodes (similar to that from msprime)? This would require maintaining unary nodes in the tree sequence, such as recombination nodes and coalescent nodes within other trees. In the below example, I've colored the recombination nodes yellow and the unary coalescent nodes red. Because more nodes are included in the msprime format, the ancestral node labels will most likely change during the conversion.
From our conversation earlier today, it seems like the information may be there already but it could take some work to get it in the correct format.
Hi @kitchensjn, I think I will start the discussion by saying that the way the SPGR is defined can be used to find these unary nodes. Suppose the cut happens exactly at a branch at its lower node, then the way the cut extends leftwards/rightwards tells us about the branch location and spatial extent of it. So the information is definitely sufficient. The only thing which might be of concern is how to pull this out efficiently, as it will be a much bigger graph.
Sorry, thought that I had responded! Thank you for looking into this. Please let me know if I can help in any way!
I have a tentative plan, and maybe you can help me check it it could work:
(1) Adjacent trees will be separated by a SPR move, and the SPR move will involve a partial branch that is removed, and a new partial branch that is added;
(2) Some unary nodes will be on that removed partial branch, and some unary nodes will be on that added partial branch;
(3) Now I think I can provide to you the list of unary nodes on each SPR, both the removed and added partial branch;
(4) I think you can go from here to a "full ARG", if you know how the tables object are built and how they can be converted to tskit "tree sequence";
(5) the fundamental idea is that the edge tables denote when a branch start and end, and they should start at the SPR that adds them and end at the SPR that removes them.
Let me know if you think this could work, and if you prefer me to help more on the conversion!
Going off Figure S3 (if I'm following your steps), that should work. I can definitely take a stab at that and see if I run into any snags. The trickiest part then seems like it will be keeping track of the edge spans for partial edges as more SPR moves are added, but we will at least have the connections. Sounds like a good starting point. Thank you!
Hi @kitchensjn let's maybe first discuss the conceptual implementation first. So here is my proposed solution:
(a) would be the information I provided for now out of SINGER. So a bunch of marginal trees and its associative SPRs.
(b) Now I will extend the coalescence nodes. Note that 0, 1, 2, 3 are leaf nodes so forget about them. 4 and 8 persist in all 3 trees so forget about them too. For example, 7 can be extended rightward, and 5 can be extended both ways.
(c) next I will extend the recombination nodes x and y. The procedure is the same as in (b).
The extension rules are in Fig S7. Let me know if you agree with them, and then we can think about how to best implement this in a program.
One note is that tskit represents recombination nodes with two IDs, associated left versus right parental edge. This could be potentially solved during the extension phase where trees to the left of the SPR are given the node label y1 and trees to the left are y2 (I believe that captures it but may be missing something). So trees 1 and 2 would have y1 and tree 3 would have y2 to show that the parent is different. This could also be incorporated during the tskit conversion phase if that is not possible. The tskit-dev group may have opinions on this and whether it is important. The output of part 3 otherwise looks perfect!
Hi @kitchensjn thanks for the insight, you are right about the tskit format. I am curious whether that "splitting" features matters for your project or not. If not, I am sure the conversion to tskit should succeed even without the "splitting". The only concern I had is that your programs will be dependent on this subtlety. Could you let me know how your program works and comment on do you need that feature? Or just a generic graph structure should work for you?
One note is that tskit represents recombination nodes with two IDs, associated left versus right parental edge
I think this might be an msprime oddity rather than a tskit thing -- that is, I don't think there's anything incorrect about having a single unary node for the recomb event. Would be good to get Jerome's input though.
^^^ Yeah Nate's points exactly. I am very confident about the conversion without worrying about the recombination ID. It is just whether it is important for James' project.
Could you let me know how your program works and comment on do you need that feature? Or just a generic graph structure should work for you?
Our code currently assumes that we are using a two ID format since we've solely been working from msprime/SLiM outputs, though this should not be difficult to change (I'll just add an extra "format" parameter). So I'm all for providing the generic graph it that makes it easier, and then we can make the necessary modifications on our end.
The only structures that aren't generally preserved in that generic graph format are "diamonds" due to how the tskit edge table is organized. That is a subtle case and may not matter here.
Just to jump in here: the 2-node version is indeed an msprime oddity (see e.g. the margin-note at https://tskit.dev/tutorials/args.html#visualization), but is required for msprime to be able to calculate the full ARG likelihood. The explanation is at tskit-dev/msprime#1942 (comment).
We should have a built-in way to convert from a 2-RE-node to a 1-RE-node representation. If we want to be able to convert back again, we would need to store some extra node metadata, however.
By the way, if we stored separate edge and interval tables (rather than a single table with both parent/child and left/right data), then we wouldn't have this problem, because we could group multiple intervals into one edge from A->B and another set of intervals into a separate edge, also from A->B. This also suggests that representing an ARG as a multidigraph which can also have multiple intervals on each edge is probably the best way to think of storing the graph structure.