jeromekelleher/sc2ts-paper

Issues with backbone phylo section

Closed this issue · 8 comments

  • We leave a big questions about why we have recombination in the reduced trees at all hanging. We need to at least hypothesize what's going on, even if just to say more work is needed to figure it out.
  • In retrospect the left-most tree is a bad choice, as this is likely to be enriched with artefacts. It only covers 0--64, where we have no sequence data at all, so the first break is almost certainly rubbish. Maybe we should (arbirtrarily) choose the tree in the middle of the genome instead?

Good points. Re point 2, the middle (position 14952) seems good to me. Attached is the PDF for that position, which seems fine.

Re point 1, jeromekelleher/sc2ts#127 is relevant. There is code to simplify the cophylogeny to include only those lineages involved in the recombinations, which is in a minor branch of my sc2ts repo:

https://github.com/hyanwong/sc2ts/blob/cophylogeny/notebooks/Cophylo.ipynb

The recombinations leading to the different trees looks like this:

image

cophylogeny_wide.pdf

Here's something I'm not understanding. I can make a list of all the samples that have identical paths in all trees to the root (i.e. are not affected by recombination), and then simplify the ARG to remove those as samples (while still keeping unary nodes). This results in a "recombinant_only" tree sequence with the same number of trees as the original but far fewer nodes.

Now I simplify the ARGs to remove unary nodes. But if I simplify the recombinant_only ARG, I get fewer trees than when I simplify the original ARG. I don't understand this:

def ancestors(self, u):
    """
    Returns an iterator over the ancestors of u in this tree.
    """
    u = self.parent(u)
    while u != -1:
         yield u
         u = self.parent(u)

recomb_affected = set()
first_tree = sc2ts_its.first()
parents = {u: set(ancestors(first_tree, u)) for u in first_tree.samples()}
for tree in sc2ts_its.trees():
    for u in tree.samples():
        s = set(ancestors(tree, u))
        if s != parents[u]:
            recomb_affected.add(u)

recomb_only_ts = sc2ts_its.simplify([u for u in recomb_affected], keep_unary=True)
print("backbone ARG =", sc2ts_its.num_trees, "trees. Recombinant-only =",
      recomb_only_ts.num_trees, "trees")
print("Simplified: backbone =", sc2ts_its.simplify().num_trees, "trees."
      "Recombinant-only =", recomb_only_ts.simplify().num_trees, "trees")

Gives

backbone ARG = 8 trees. Recombinant-only = 8 trees
Simplified: backbone = 6 trees.Recombinant-only = 3 trees

Here's something I'm not understanding.

OK, I think I get it now. There is a relationship between a recombinant node and a set of non-recombinant nodes (say a trichotomy). The recombination event changes this relationship. Even when simplified, the topological difference means the breakpoint is left in. But if the non-recombinant nodes are removed, then the simplified topology is the same between the trees either side of the breakpoint, so the 2 trees are collapsed into one. Here's an example drawn out on the board (red is the recombinant). If we remove the two (unrecombining) black samples, the relationship between red and green will be given by a single tree.

IMG_2953

Here's the actual edge diff for the first breakpoint that is seen in the simplified backbone tree but not in the simplified recombinant-only tree (magenta marks the samples that descend from a recombination node):
Screenshot 2023-05-10 at 17 01 24Screenshot 2023-05-10 at 17 01 34

Here's the "full" backbone ARG - edges in coloured cyan, edges out coloured red. This might be better as a graph plot. Possibly worth discussing on a big screen.

image

Thanks - this has always confused me. Can we make a graph plot?

I had to hand-modify it to get a half-decent graph plot:

nxstr-rec.pdf

I think we can close this now?