labgem/PPanGGOLiN

Phylogeny Question

Closed this issue · 1 comments

Does anyone know whether it is better to create a phylogeny using the conservative core genes vs the soft core genes (persistent)? I know using the conservative core genes is the standard. However, I feel like using the soft core (genes present in 95% of the genomes) might provide higher resolution? Due to the extra genes included. Especially when an outgroup (closely related species to help root the tree) is added.

Hello,

Thank you for your question. I am not an expert in phylogeny, and I can only give you my personal opinion.

To begin with, I think you are right to prefer the persistent genome to the core genome, especially for large pangenomes. I know that as the number of organisms in the pangenome increases, the core genome decreases, and you can have a really tiny core genome. After maybe, it is possible to run both analyses and look at the differences between the persistent genome and the central genome in the tree construction.
Also, I think the parameters are also important, because they change the persistent genome and the core genome. Depending on your phylogeny, you can use different levels of identity and coverage.

In PPanGGOLiN, by default, we remove the fragments and do the clustering with MMSeqs2 where you can choose the identity and coverage parameters. If you prefer, you can also provide your own clustering. When you build the graph, you can remove gene families that are too duplicated in your genomes. Finally, when you partition the pangenome graph, it is possible to define the number of partitions in the pangenome, to change the central/persistent genome.

Again, I am not an expert in phylogeny, and I will follow and contribute to this discussion with pleasure. I hope I have helped you.

Thanks again.