BioJulia/Phylogenies.jl

Unifying Phylogenetics

Opened this issue ยท 30 comments

This continues from BioJulia/Bio.jl issue number 263

@Ward9250 @jangevaare @cecileane @richardreeve

I still think it would be excellent to have a uniform framework for phylogenies in Julia. Currently there are (at least) 4 different packages for phylogenies, with different implementations and different advantages (Phylo has good import facilities and a newick parser, PhyloNetworks allows for reticulated phylogenies and is quite featured, PhyloTrees has plotting recipes and is lightweight, Phylogenies is based on LightGraphs which is a really cutting-edge graph library).

Having four different packages is not a good situation for the ecosystem. Of course individual users can just choose which package they like and forget the others, but it is hard for downstream libraries to incorporate phylogenies - two packages choosing different phylogeny libraries would become incompatible etc.

The easiest and clearest solution to this, IMHO, would be to define a PhyloBase package, much like StatsBase, which defined an AbstractPhylogeny type and stub functions (isultrametric(x...) = error("This phylogeny library has not implemented the isultrametric function")) for all of the functions that one can agree must form the basic interface to phylogenies. If all four of you could agree on depending on that package (and reexporting it), and implement the shared interface functions in terms of your syntax and functions in each of your packages, downstream users and packages could just depend on AbstractPhylogeny too, program using the shared interface and let the user choose which phylogeny package to use.

This is an extension of Julia's concept of informal interfaces (https://docs.julialang.org/en/latest/manual/interfaces/#Interfaces-1) and the model of many of the must successful ecosystems. Phylogenies.jl already has the beginnings of such an abstract interface, which could be moved out.

Would you be interested in doing this? I don't think it would limit each of your abilities to develop your packages freely. I wouldn't mind helping with some of the grunt work involved.

I currently have implemented something approximating such a solution in richardreeve/Phylo.jl - there is a public interface in src/Interface.jl and a private API that has to be implemented by any new subclass of:

abstract type AbstractNode end
abstract type AbstractBranch end
abstract type AbstractTree{NodeLabel, BranchLabel} end
abstract type AbstractInfo end

in src/API.jl. This was originally a separate package called AbstractPhylo.jl to unify the interface with PhyloTrees, but ultimately Justin wasn't interested. For my tastes it's much too complicated, but at the time it just built on the existing PhyloTrees interface. Anyway, generally I agree some consistency would be great I guess, though I'd happily delete mine given the chance if other packages implemented what I needed!

I'd still love to see a single minimal, core Bio.jl owned Phylogenetic tree/network package that was extended based on our needs/research into our own maintained packages. The fact that we haven't been able to make progress on such a goal may suggest something like that is too idealistic though...

If we all wish to continue doing our own thing (perhaps, in the interest of our own research, and not so much in the interest of success of phylogenetics in Julia in general...), I do think at the very least we should do something as @mkborregaard suggests. An interface to the several current packages, ร  la Plots.jl would really help. We can continue to do our own things, however we want to do them, but we shield the average users from the pedantry.

Hi all, I think this is a great idea, I'm already aware of some current attempts to do this in @richardreeve's packages, I'd be very happy for BioJulia/Phylogenies.jl to be said interface provider.

I made noises about this quite a while ago, I should apologise to you guys for not doing more, about it for phylogenetics, my evolution related programming this year hasn't revolved around trees though but on pop-genetics on sequences.

But I'm a fan of BioJulia taking an interface oriented approach to all the packages: on a branch of my BioSequences.jl fork I've been renaming types and refining the API to be much more "trait/interface-ey".

For the record, if there was to be this single idealistic package, I do think lightgraphs.jl based is probably in our best interests. Phylographs.jl anyone?

Obligatory:
xkcd: standards

To avoid the problem highlighted in the xckd, I think the interface of Phylogenies.jl like the good interfaces of Base, should concern themselves with the behaviour of phylogenies, rather than the specific internal representation of the data. Moving to a trait (or Concepts if you like C++), based interface. That should then place it in a good position to allow people freedom in their exact implementations, whilst brining order to any excess chaos. Much how the internal representation of the substitution models were different, but high level behaviour is consistent.

Yes - I was definitely suggesting something simple (i.e no AbstractNode etc) and implementation-independent. In that sense these interfaces are not standards, it's a common shared behaviour that packages can implement in addition to particulars. One place to start would just be the union (EDIT: I mean intersection) of behaviours (i.e. functions) that are already defined (possibly under different names) in all 4 packages.

That's fine - I agree, in fact - and the AbstractNode, etc. certainly isn't compulsory, but but it's worth pointing out that at the moment following at least parts of that API (addbranch!, addnode!, primarily) will give any phylogenetics package a newick parser (with some minor modifications), and the ability to translate to and from the phylo class in R.

My only concern about using the BioJulia/Phylogenies.jl interface for everything, is that I'm not sure what that is... @Ward9250 do you just mean all of the functions currently exported by the package? I think realistically there would have to be a serious discussion about what the interface ought to look like.

At the moment, what I've done is pretty rubbish, but I would like to be able to use the final interface to copy trees freely between implementations ideally, and certainly to write code that works with anything that implements the interface, so that tools like the parser can be written exactly once.

I very much like the idea of unifying the landscape with a "trait"-based approach: to allow different packages their own internal structure. Different implementations work best for different goals --for example the internal structure in PhyloNetworks was meant for easy transformation of networks, like prune-and-regraft etc. A couple notes about import/export between packages:

  • if each package has a robust newick parser (like Phylo and PhyloNetworks at least), then we can use the newick string to easily export/import trees between different package implementations.
  • PhyloNetworks has an export function (sexp) to export a phylogeny into ape's format in R (phylo class for trees, evonet class for networks). example below.
julia> using PhyloNetworks

julia> using RCall

julia> net1 = readTopology("(((A:4.0,(B)#H1:1.1::0.9):0.5,(C:0.6,#H1):1.0):3.0,D:5.0);")
PhyloNetworks.HybridNetwork, Rooted Network
9 edges
9 nodes: 4 tips, 1 hybrid nodes, 4 internal tree nodes.
tip labels: A, B, C, D
(((A:4.0,(B)#H1:1.1::0.9):0.5,(C:0.6,#H1):1.0):3.0,D:5.0);


R> $net1
$edge
     [,1] [,2]
[1,]    5    6
[2,]    5    1
[3,]    6    8
[4,]    6    7
[5,]    7    2
[6,]    8    4
[7,]    8    9
[8,]    9    3

$Nnode
[1] 5

$edge.length
[1] 3.0 5.0 0.5 1.0 0.6 4.0 1.1  NA

$reticulation
     [,1] [,2]
[1,]    7    9

$tip.label
[1] "D" "C" "B" "A"

attr(,"class")
[1] "evonet" "phylo" 

R> library(ape)

R> $net1

    Evolutionary network with 1 reticulation

               --- Base tree ---
Phylogenetic tree with 4 tips and 5 internal nodes.

Tip labels:
[1] "D" "C" "B" "A"

Rooted; includes branch lengths.

To move forward, we would need to list of traits that each package should have, ideally. To get it started: isultrametric, tiplabels, nleaves, nedges etc., something for the phylogenetic distance between tips (or between all nodes). I am afraid that generic AbstractNode and the like would get too much into the detailed implementations. Even tree transformations like nni (or nearestNeighborInterchange) could be difficult to standardize, because of the different ways to indicate an edge around which the nni should be done. If there is agreement about an AbstractPhylogeny type and stub/trait functions, I would gladly jump in to help make this happen.

A very useful feature would be some standard to unite a phylogeny with data.

For statistical models, "fitted" models contain information about both the model and the data, so we can ask questions like AIC(model) or loglikelihood(model). Here, models include a phylogeny and some parameters, and the data might be molecular sequences and/or traits.

Having a structure for phylogeny + data has been a struggle for R folks, with a standard that was developed but not really used. @bomeara would know better than me. The difficulty is that data comes in data frames, with taxa in rows say, but taxa may be ordered differently in a phylogeny. Another difficulty is when we want to attach character states along edges, with possible shifts and multiple states along a single edge. If we could learn from what works in R, and set a standard now before each package has its own implementation, that would be great.

@cecileane After going back and fourth a bit with this myself, I take a similar approach and use Dicts for node data with PhyloTrees. I keep as a separate object from the tree itself (at a time I didn't, however). I do think this is more suitable than a dataframe, but I understand it is not as integrated as what some are looking for. In fact at this time, I have no special code included in PhyloTrees.jl for managing node or branch data (branch length notwithstanding). I've been considering implementing a special Dict for nodes and branches, with an API that was more accessible for this application...

Thanks @jangevaare. Do you handle missing data as missing keys in Dicts?

To provide a wrapper, is there a standard function to take a data frame, a phylogeny, a formula perhaps (like y~x to indicate that y is the response, x is a predictor), and returns a dictionary for each necessary trait, with keys restricted to tips in the phylogeny? I could use such a function for what I am doing right now.

For our work on trait evolution (in PhyloNetworks), we built on tools from the GLM package, which uses data frames for input data. So a standard for data frames could be useful, still.

@cecileane Yep! For the key I use a Int64 node id, which corresponds to my definition of Trees

struct Tree
  nodes::Dict{Int64, Node}
  branches::Dict{Int64, Branch}
  height::Float64

  Tree() = new(Dict{Int64, Node}(), Dict{Int64, Branch}(), 0.0)
  Tree(x::Float64) = new(Dict{Int64, Node}(), Dict{Int64, Branch}(), x)
end

Adding an extra key look up even though it is very fast to do, I'm not 100% sure is the best approach. It is very simple from a developer and maintainers POV. This is where I think @richardreeve and I disagreed (@richardreeve went on to define parametric nodes in his package IIRC).

To get back on topic:
Having Newick import-export functionality in all packages to allow for moving between them is an option (to start). Perhaps a general method for some kind of convert() function could exist which relies on this common functionality, and more specialized methods can be provided developer-willing... and work from there.

I currently don't have Newick string functionality, so I will look to all of your packages that currently do, and lift and modify for my own use license depending :)

Phylogeny.jl, which uses LightGraphs, could use MetaGraphs.jl for the phylogeny+data, that would be very efficient (but that's an implementation detail, not an interface detail so I guess that wouldn't need coordination among packages??). @Ward9250 given your suggestion to let your abstract types provide a basis for the abstract interface (and if people can get behind that), what would you need me to do to help you get that rolling? @cecileane just out of curiosity, where did you end up getting PhyloNetworks out? I still can't belive the Sys Bio editor rejected it on two utterly positive reviews.

Actually I think abstract types for nodes and edges might be very good for compatibility, even more flexible and powerful that a newick string based method of transforming from one representation to the other, if it is well designed.

Say there is a type of tree which <: from an AbstractTree(or Phylogeny), it has traits of say nodetype, and branchtype, (in the same way collections have eltype) and that package implements types for concrete nodes and branches that inherit from some abstract node and some abstract branch. Then traits could define what those nodes and branch types support (confidence levels, branch lengths, times) and so on. Well then suppose there is another type of tree <: AbstractTree with it's own node and branch types which adhere to the same abstract interface and traits - pretty generic conversion methods may be defined and through getters/setters, pruners and grafters of each tree, (some of which may also be generic to some extent based on traits and dispatch), then you should be able to go from one representation to the other with some caveats such as there being optional traits one representation does not support, but that can be caught and warned about. Such an interface also works without writing to an intermediary structure.

thanks for asking, @mkborregaard! PhyloNetworks went to MBE. The SB editor mentioned a preference for standalone packages as opposed to R packages, for instance --what can we do.

I completely agree that a newick-string-based approach would be inefficient, and a trait-based approach would extend possibilities to handle many things beyond what we can store in the newick string.

MBE's a nice outcome too, great. Yeah not wanting a julia package is 100% fair but then it would seem more natural to reject it editorially IMHO.

Yeah @Ward9250 I've been looking at the exported functions from all 4 packages today, and I agree it's nice and powerful with an AbstractNode/AbstractEdge -based function system as well, but will all implementations support it? I'm not sure PhyloNetworks will?
There are two ways, I think, to call a function on a node - either have the node as an object (e.g. isleaf(node::AbstractNode) or have the node as an index into a Phylogeny (the R way; e.g. isleaf(phy::AbstractPhylogeny, node::Int). Phylo, Phylogenies and PhyloTrees are compatible with the first way. Here's the rub - I don't think an implementation following the second method could become compatible with a method on AbstractNode. One solution could be to have isleaf defined on isleaf(phy::AbstractPhylogenyIterator, node::Int) which would work for both implementations but not very efficiently I'm afraid. Does this make sense?

PhyloNetworks too should work the first way: like isleaf(node::AbstractNode).

Oh cool, I missed that. Then I would appear that AbstractPhylogeny, AbstractNode, and maybe AbstractEdge and AbstractDataPhylogeny could be the types of a common interface for all? That's cool.

Hi all, I'm afraid I'm in Tanzania at the moment without great internet, but I definitely like the direction this is going. As I've said elsewhere, my main interest isn't phylogenetics, and I'm more than happy to remove my package if it causes trouble and something else covers what I need - I just created it to scratch my own itch because I couldn't get other packages to work for me. In particular, what I most needed was and a random tree generator, handling of at least tip traits, and possibly internal node and edge traits too, and then an R and newick interface turned out to be handy too.
Anyway, I have a few comments, some of which refer back a couple of days:

  • I think a fallback of interoperability through newick trees as @cecileane suggests is okay as a last resort, but definitely not what we should be aiming for. Too much information can be lost. And it also requires lots of separate implementations of parsers, which are hard to get right.
    • Mine, for instance, doesn't read in all of the metadata about nodes and edges (though it does know enough to ignore it).
    • It also only reads newick trees, not nexus, which is what I'd prefer, so I have to strip out the header information (and so often the names of the leaves!) before reading, and it can't read multiple trees in a file. I hope to fix these things, but since my tree class doesn't allow multiple trees in it (a la multiPhylo in ape), the latter will need considerable work.
    • I don't know what the situation is with PhyloNetworks for these issues?
  • Likewise my interoperability with R allows translation back and forth with phylo (but not multiPhylo) like PhyloNetworks I think, but is also a little hard to get right, and is the kind of thing it should be possible to provide in common through the right interface.
  • As far as allowing AIC(phylo)-type interfaces, that seems like a separate issue from the phylogenetics entirely, but rather an interface for all statistical models?
  • My concern with MetaGraphs.jl and similar approaches was how generic it was - I'd like some enforcing of (e.g.) not loops in the structure. Something that general would be useful to get ideas for the interface, though.

On more specific points about the interface:

  • I'm not a fan of enumerating nodes as 1-n rather than having arbitrary labels (which can be integers) - this means that if you delete nodes from the tree, you have to renumber them, and potentially all of the metadata. Even worse if you want to keep compatibility with phylo in R, then adding tips has the same effect as the number has to stick in the middle. I think that approach is archaic, and a complete break at this point would be good.
  • I'm definitely a fan of AbstractTree, AbstractNode and AbstractEdge/Branch, but I'm uncertain about how much information should be stored in them / accessible through them - I only have the labels of connecting branches for nodes, and the labels of connecting nodes and the branch length for branches. Metadata is stored separately.
    • This means that you can't necessarily get the metadata from a node though, as it won't have a reference to the metadata store.
  • I also like iterators over them to go through the contents of the tree. For instance, I like the fact that I can do branchiter(tree) and iterate over the branches, and nodefilter(tree, isleaf) and iterate over the leaf nodes. I don't know how the other packages do this?
  • Any <: AbstractTree implementation that didn't internally use some <: AbstractNode or <: AbstractBranch for storage internally could easily write wrappers that took whatever information we decided that these should provide and created an object on the fly.
  • @mkborregaard I'm not sure what the AbstractDataPhylogeny is for? Wouldn't we just have some null/missing response if you ask for metadata for a tree without one?

Last one, I promise! As far as the interface itself is concerned, we're talking above about the function signatures, which seems a bit premature. What is the actual functionality we need for interoperability? I think we need to be able to:

  • get [an iterator for] the nodes or nodenames of a tree
  • get [an iterator for] the tips or tipnames of a tree
  • get [an iterator for] the branches or branchnames of a tree
  • get a node or branch from a tree by its name
  • determine whether a node is:
    • internal, a root or a leaf
  • somehow create a tree
    • I do it by the ability to create a node, and the ability to create a branch between two nodes
    • or via the ability to create a node, a branch, and connect up nodes and branches?
  • delete a node (and possibly a branch, or maybe they die when their attached nodes die)
  • read and write metadata:
    • get the metadata associated with a node (or nodename?) from a tree (or a node?)
    • add metadata associated with a node (or nodename?) to a tree (or a node?)
    • do we need to make it possible to have different metadata for tips and internal nodes?
    • do we need to allow missing metadata (possibly as a solution to above), or is that implementation dependent?

Things that we might then be able to provide at a high level (not in individual packages):

  • read in a tree from a file and write it back
  • translate a tree to R and back
  • validate that a tree is legal

On things that perhaps don't belong:

  • Should we use the Sampleable abstract type and the rand() interface provided by Distributions.jl to create random trees, instead of inventing our own? See my implementation for instance.
  • Should we use some interface from JuliaStats (I presume? which?) to provide things like loglikelihood() and AIC()?

I'm sure there's lots more, but those are my thoughts for now.

I think the first thing to think about is a shared interface. Interoperability is also cool, but I think separate. Also, things like building a phylogeny on MetaGraphs is implementation (wrt loops in the structure PhyloNetworks has that - but you could enforce that by using the SimpleGraph from LightGraphs).

I do agree with getting an overview of shared functionality, and with the list @richardreeve provided above. I've gone through the different packages and tried to compile a list of all the implemented functions, to scope how much functionality is overlapping. I've compiled a draft into a worksheet, here https://docs.google.com/spreadsheets/d/1Y1MvYA5AMs6Fue7xFe8-ghq-oH7sWKXA3k666mmkIAM/edit?usp=sharing
Please note that I don't know all four packages super well so there may be lots of misplacements. Could you have a look? There's surprisingly little overlapping functionality.

Actually for what I wrote above, it turns out that Phylo, PhyloTrees and Phylogenies all have isleaf defined as isleaf(phy::AbstractPhylogeny, node::Int), not isleaf(node::AbstractNode) (only Phylo has that, and based on the commen above PhyloNetworks could get it). I think the latter is more generic, but it requires AbstractNode to know which Phylogeny it is part of, and I guess it doesn't with the other packages? (I do agree totally that a break with node enumeration would be nice, though again LightGraphs uses enumeration I think but has a way to deal with it that doesn't cause the outlined problems. Again, indexing into an iterator (depthfirst(phy)[3]) would be a way to get the best from both worlds.)

๐Ÿ’ฏ ๐Ÿ‘ for having different types of iterators for traversal, to use the Distributions interface for randomness and the StatsBase interface for statistical functions

@richardreeve when you talk about a wrapper, do you mean something like this?

module PhyloBase

abstract type AbstractNode end
isleaf(node::AbstractNode) = error("")
root(phy) = error("")::AbstractNode

end

module MyPhylogeny
import PhyloBase

struct BaseNode <: PhyloBase.AbstractNode
    phy::Phylogeny
    node::Int
end

myroot(phy) = 1
myisleaf(phy, node) = node <= ntips(phy)
PhyloBase.isleaf(node::BaseNode) = myisleaf(node.phy, node.node)
PhyloBase.root(phy) == BaseNode(phy, myroot(phy))
end

which keeps the shared interface and package interface separate (in that PhyloBase functions only accept and return Base and PhyloBase types)

regarding newick parsing (to answer your question @richardreeve), PhyloNetworks is also limited to plain newick trees, not nexus files (so taxon names need to be in the trees). We can read multiple trees (or networks) in the same file. Metadata about branch support is ignored. Other metadata could cause issues. So yes, I agree that newick parsing is tricky and is a last resort option for interoperability.

About interfaces like AIC(phylo), there is already this kind of interface in StatsBase.jl, with an abstract type StatisticalModel etc. What we could think about, before 14 different standards exist, are ways to combine a phylogeny with a data set. Standard statistical models don't have this issue: models might consist of an intercept, a slope and a variance parameters (to simplify things). The fitted model can easily store these numerical parameters and the original data together. On trees, we need to make sure that the tips in the tree match correctly with rows in the data, for instance. For models with shifts in parameters along the phylogeny, the model might need the position of a shift along a particular edge. I don't have a specific problem in mind, so my comments is not very useful here. But thinking ahead might help with the problem of unifying implementations for phylogeny+data.

There are ways of handling node data that don't have the issue of being invalidated if the topology changes, as you do if data is stored in a DataFrame where a node corresponds to a row (and thus has to be considered as part of a linear order, as described by richard above). E.g. a Phylogeny could store data as a Dict{Node, NamedTuple}, where the Node object itself is used as the key - this would be efficient and flexible.

@cecileane The inheritance from StatisticalModel to provide AIC(), etc. is inevitable I guess, but unfortunate for now, given we can't have multiple inheritance - however, that will hopefully be fixed in the medium term with a proper mechanism in Julia for interfaces. In the meanwhile, the two obvious solutions are:

  1. to mirror that interface without any formal connection to it, with the aim of adopting as and when possible, or
  2. to create a wrapper class which contains the tree and itself inherits from StatisticalModel

I don't know how that kind of information is stored in the tree - does it fit in the standard newick format, in which case 1. would presumably be best, or is it held outside somehow, in which case 2. would presumably be a better approach?

As far as storing associated data, I absolutely agree that some standardisation would be fantastic. For me, it's the main thing, and in fact the phylogeny is really the added data, with species abundances and locations of samples, etc. as core. There's no reason this can't work either way though. At the moment I fix the leaf order when I create a tree by using an OrderedDict to hold the nodes, and then reread the leaf names as I bring the tree into my Diversity package to check that worked and use that order for my other data (which is in some subtype of AbstractArray). However, the other thing I am strongly considering is to use categorical AxisArrays, which should ensure that no matter how stupid I am, everything remains aligned - they say this is very fast, and we're beginning to play with it in anger. That interferes with the idea of using a DataFrame, but I need my data in arrays so I can do fast computation on them - I don't find dataframes to be fast at all at the moment (I may be doing something wrong?). Some kind of Dict, as @mkborregaard suggests, would also be possible, but would also be slower I imagine.

nice work on tidytree in R, showing the need for a data structure combining a phylogeny with data long edges & nodes. also gives examples of getters: child(node), parent (or parents to include networks), offspring, ancestor, sibling (or siblings to allow for polytomies).