lutteropp/NetRAX

A short observation about per-displayed-tree-clvs

Opened this issue · 10 comments

Naively, one would store number_of_network_nodes * number_of_displayed_trees CLV vectors. However, there are regions in the network that are exactly the same over multiple displayed trees. Those regions can be identified with the blob optimization trick.

---> One actually can save quite some memory and computations in there, if one cleverly reuses some CLVs/ shares them among multiple displayed trees. It will be tricky to make this work together with dynamical changes (network moves) to the network data structure.

Lol. Turns out blob optimization is not the way to go with this new likelihood definition we are using. Instead, each node in the network needs to store 2^(number_of_reticulations_in_subnetwork_rooted_at_the_node) clv vectors. I am currently working out the details (e.g., have subnetwork-specific tree_ids that encode which reticulations were taken, and combine them via bitshifting), but the potential speedup looks promising :sunglasses:
And now that I have a working naive per-displayed-tree-clvs implementation (that stores 16 clvs at each node in the example below) I can refine, it's actually not that big of a coding effort once I figured out the details on paper! :nerd_face: (the annoying thing remains redirecting all these clv-vector pointers from the pll_treeinfo_t object all the time)
(I am sick right now :face_with_thermometer:... but I can still do research while sneezing all day :test_tube:)

Okay, if we had memory usage issues and needed some tuneable in-between solution, then we could do with multiple clv vectors only at the megablob roots, recomputing all the other clv vectors within a megablob on-the-fly while iterating over its trees. And then, then it would make sense to use gray-code iteration order to minimize the number of clv re-computations within a megablob. But with low numbers of reticulations we are currently talking about, and still rather low number of taxa, and not superduper insanely large MSAs, memory issues are currently not our priority.

8afd0042e488d2c806de6534af5e30c4-3

Maybe we can even reduce the number of stored CLVs a bit more... to ignore unreachable/dead areas (this is, areas that have some inner node with suddenly no active children, due to how the active reticulations were chosen).

But even if memory usage ends up being an issue in the future, the blobs per-se do not make much sense anymore.
We could simply arbitrarily choose some on-the-fly nodes for which we don't want to keep all their clvs in memory, but rather recompute them on-demand...

RIP blobs and gray-code stuff... well, they still make sense for the "wrong" likelihood definition. Maybe we can use the wrong one as a heuristic during horizontal search, to speed things up a little...

Even more CLV saving potential, but not worth the extra effort and dirty case distinctions needed:
8afd0042e488d2c806de6534af5e30c4-6

Okay, actually this extra clv/computation saving is easier to implement than I thought. I'm confident that I can do it without much extra effort 🙂
8afd0042e488d2c806de6534af5e30c4-7

In order to avoid pointer craziness, I will modify pll_update_partials in my libpll2 fork: Instead of giving it a long list of operations, I will always just give a single pll_operation_t. And I will specify which clv vectors and scale buffers to use (for parent, left, right) via the function call, thus entirely avoiding the use of partition->clv and partition->scale_buffer.

I will also modify pll_compute_root_loglikelihood accordingly.

The speedup potential of this is HUGE, as it will give us back true incremental loglikelihood computation in networks!!! :-)

(Also, I already have an idea for a very ad-hoc pseudolikelihood function that's quick to compute and easy to integrate then -> the one I wanted to try from the beginning... it apparently makes no biological sense, but maybe it serves as a good speedup-heuristic for identifying/pre-scoring promising move candidates)

Topologies like this one complicate things, due to not only reticulations, but also dead nodes. I have annotated the reticulation configurations for the displayed trees that need to be stored at each node in the picture on the right:
Unbenannte Notiz - 07 03 2021 03 03 - Seite 1

Luckily, I have already figured out a solution.

  • Solution for processNodeTwoChildren: For each left tree, we need to check under which conditions (if any) it is alone. For each right tree, we need to check under which conditions (if any) it is alone. And a combined tree from left and right can have multiple reticulation configurations as well...
  • Also, each displayed tree stores multiple reticulation configurations. A new ReticulationConfigSet data structure makes sense.
  • This also changes (simplifies?) the handling of reticulation children, as having an inactive reticulation child or a dead node child is kinda the same thing.

Done! It works nicely and is ready to be integrated into the master branch. 😎