/PoMo

Implementation of a polymorphism aware phylogenetic model using HYPHY

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

A newer, reversible version of PoMo implemented in IQ-TREE is available. We highly recomment to use the new version; please check out the homepage of IQ-TREE or directly go to the PoMo development branch of the respective GitHub repository.

Further, the conversion scripts have also moved to a new, separate repository: counts file library.

Schrempf, D., Minh, B. Q., De Maio, N., von Haeseler, A., & Kosiol, C. (2015). Reversible Polmorphism-Aware Phylotenetic Models and their Application to Tree Inference (Manuscript in Preparation).

L.-T. Nguyen, H.A. Schmidt, A. von Haeseler, and B.Q. Minh (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol. Biol. Evol., 32, 268-274. DOI: 10.1093/molbev/msu300

PoMo 1.1.0

Implementation of a polymorphism aware phylogenetic model using HYPHY.

Created by:

  • Nicola De Maio Contributors:
  • Dominik Schrempf
  • Carolin Kosiol

For a reference, please see and cite: De Maio, Schlotterer, Kosiol (MBE, 2013), and/or: De Maio, Schrempf and Kosiol (ms. in preparation). You can use this software for non-commercial purposes but please always acknowledge the authors.

Feel free to post any suggestions, doubts and bugs.

libPoMo

PoMo comes with a small python package libPoMo. It contains several modules that ease the handling of data files in fasta, variant call (vcf), or counts format (the file format PoMo works with).

For a detailed and comprehensive documentation visit libPoMo documentation.

However, it is highly probable, that you only need to consider the file conversion scripts to prepare counts files to be analyzed with PoMo.

Installation Requirements

Before installing, check the following requirements:

Installation

Download PoMo with:

git clone git://github.com/pomo-dev/PoMo

This will create a folder PoMo.

PoMo heavily relies on HyPhy. It comes with its own, patched version that can be found in the folder created above. To build this version of HyPhy, extract the HYPHY.tar.gz (this archive is part of the git repo; PoMo will not work if you use another version of HyPhy) and do

cd path_to_HYPHY/
cmake ./
make MP2

Running PoMo

To run PoMo, execute with python3

python path_to_PoMo/PoMo.py  path_to_HYPHY/HYPHYMP path_to_data/data.txt

The data file must either be in fasta format, or in counts format (see example files in ./sample-data/).

If your dataset is called "data.txt" the final output of PoMo will be called "data_PoMo_output.txt". It will contain the log-likelihood, the estimated parameters, and the estimated species tree. The output will be placed in the folder that you run PoMo from.

Command Line Arguments

PoMo supports various command line arguments. Run python PoMo.py --help or visit PoMo-help for a detailed help message about the different flags that are available.

File Conversion

Various scripts for file conversion are provided in the folder ./scripts/. Run python scriptXY.py --help for detailed help messages. These scripts use libPoMo that has to be in the python path.

Data

Sample data can be found in pomo-dev/data. This repository also includes the data sets of our manuscript that has been accepted to Systematic Biology. A proper citation will follow, for now please refer to BioRxiv.

Notes

PoMo works best if data about within-population variation is provided. So, in case you do only have a single sample per species, you need to provide PoMo with some meaningful estimate of theta. In any other case, you do not have this problem.

Virtual population size is 10. This is the only case we have extensively simulated. I will relax this constraint in the future, but for now, population samples larger than 10 are down-sampled to 10.

The model assumes that all data from the loci considered is provided: both variable sites, and sites that are fixed through species and populations. If you have a SNP dataset only and do not know the level of polymorphism, we do not recommend to use PoMo, because it infers the level of polymorphism during the maximization of the likelihood and will fail then. If you know have SNPs only but are aware of the level of polymorphism, we advice you for now to add the fixed fixed sites to you data in the right proportions. This will not slow down PoMo significantly but allow you to get meaningful estimations.

The publication that is out in MBE about PoMo discusses estimation of population parameters (mutation rates, fixation biases, etc) and not of species trees. Soon we are going to submit our manuscript regarding species tree estimation with PoMo. Here the focus is solely on estimating species trees. It should be easy for me to extend the software to perform both analyses simultaneously in the future.

The computational time should be feasible for up to few dozens of species, no matter of the sample sizes and number of loci.

PoMo is heavily based on HyPhy. You should probably acknowledge HyPhy's authors (e.g. Sergei L kosakovsky-Pond) if you use PoMo.

In the fasta format all individuals must be named as "species_n" where "species" is the name of the species and "n" is a number, it does not matter which number.