HYPNO - HYbrid Protein NucleOtide phylogenetic gene tree reconstruction ================================================================================ HYPNO is a collection of Python scripts that can improve fine-branching order (topology) of phylogenetic gene trees via clade reconstruction using nucleotide information to differentiate between nearly identical protein sequences. ================================================================================ HYPNO VERSION 1.0 SOFTWARE DEPENDENCIES ===================== Python 2.7 (http://www.python.org/getit/releases/2.7/) Biopython 1.6 (http://biopython.org/wiki/Download) EMBOSS Needle (http://www.ebi.ac.uk/Tools/psa/emboss_needle/) FastTree (http://www.microbesonline.org/fasttree/#Install) INSTALLATION ===== For dependency installation, please refer to the above URLs, which contain download and installation information for Windows, Mac, and Linux operating systems. Python, EMBOSS, and FastTree executables must be included in the system PATH environment variable. These may either be added by default during installation or manually added by the user afterwards. At this point, the user should be able to invoke the HYPNO software from the commandline through the following usage options. USAGE ===== $ python HYPNO.py --msa <msafile> -tree <treefile> [options...] Options: --k <PPWID> Minimum pairwise percent identity (PPWID) for determining subtree selection (default: 90.0). --n <PPPID> Minimum "predicted protein" percent identity (PPPID) between retrieved nucleotide sequence and expected sequence for retrieved sequence to be accepted (default: 95.0). --s <PCS> Minimum percent of correct sequences (PCS) required in an acceptable tree. This values depends on --n <PPPID>. Any value less than 100.0 may result in an output tree with fewer taxa than the input tree. (default: 100.0) --opl When specified, output tree undergoes midpoint rooting and branch lengths are calculated using the protein MSA while tree topology is kept constant. --oplnuc When specified, output tree undergoes midpoint rooting and branch lengths are calculated using the nucleotide MSA while tree topology is kept constant. EXAMPLE DATA ============ Example input files, “large_sample.afa” and “large_sample.nwk” (sample tree containing 120 leaves) as well as “small_sample.afa” and “small_sample.nwk” (sample tree containing only 17 leaves) are also provided in the HYPNO source code repository. Suggested usage for these datasets is as follows: python HYPNO.py --msa large_sample.afa --tree large_sample.nwk --k 80 --n 80 –-opl python HYPNO.py --msa small_sample.afa --tree small_sample.nwk --k 70 --n 80 --opl For a description of how to interpret which subtrees have been refined by HYPNO, please refer to the HYPNO FAQ at http://phylogenomics.berkeley.edu/HYPNO/help/. OUTPUTS ======= For an input MSA file 'foo.msa', the program outputs: foo.hypno.msa This file is the nucleotide MSA in Aligned FASTA format. foo.hypno.tree This file contains the re-estimated tree topology in Newick format. Also, a timestamped folder containing intermediate and debug file information is created. EXTENDED USAGE ============== The script is invoked from the command line and the user must specify the path to the multiple sequence alignment file (MSA-file), encoded in FASTA or UCSC a2m format along with the original protein tree (tree-file) in Newick format. Both files must contain valid Uniprot accession for each sequence/leaf that are consistent between files. Additional options may be specified to refine subtree definition and nucleotide and tree acceptance criteria. $ python HYPNO.py --msa <msafile> -tree <treefile> [options...] Options: --msa <msafile> Input Multiple Sequence Alignment: the user must provide this argument followed by the path to an alignment of amino acid sequences, with UniProt accessions included in the sequence headers. HYPNO will retrieve the corresponding nucleic acid sequences and use the provided alignment as a template for constructing a nucleotide MSA. --tree <treefile> Tree re-estimation: specifying this argument along with the path of the input tree results in re-estimation of tree topology based on HYPNO retrieved nucleotide sequences. If this argument is not provided, HYPNO builds and outputs the nucleotide MSA without generating a tree. --k <PPWID> Subtree selection: Minimum pairwise percent identity (PPWID) for determining subtree grouping (default: 90.0). Adjusting this parameter may alter the size of subtrees for which topology is reevaluated. Very large values may result in no subtrees and no change in tree topology while very small values may result in overly large subtrees and coarse taxa grouping. To override the default value of 90 percent, type “--k <X>”, where X is a real number between 0 and 100. For example, to set the subtree selection value to 93, one should invoke “--k 93.0”. --n <PPPID> PercentID match: Minimum "predicted protein" percent identity (PPPID) between retrieved nucleotide sequence and expected sequence for retrieved sequence to be accepted (default: 95.0). This parameter determines acceptance criteria for retrieved nucleotide sequences. retrieved nucleotide sequences are translated to protein using appropriate codon translation tables. If the percent identity between the translated retrieved sequence and the original protein sequence falls below the threshold, the nucleotide sequence is not considered representative of the protein sequence and is discarded from the alignment and subsequently the tree. To override the default value of 95 percent, type “--n <X>”, where X is a real number between 0 and 100. For example, to set the percentID match value to 80, one should invoke “--n 80.0”. --s <PCS> Retrieval minimum: Minimum percent of correct sequences (PCS) required in an acceptable tree. This values depends on --n <PPPID>. Any value less than 100.0 may result in an output tree with fewer taxa than the input tree. (default: 100.0) This parameter determines what fraction of sequences can be discarded from the alignment and subsequently the tree as a result of the unavailability of sequences or poor identity between protein sequence and retrieved nucleotide sequence. Setting the PPPID and PCS value very high (overly stringent) may result in the production of no output tree. --opl Branch length optimization using protein MSA: When specified, output tree undergoes midpoint rooting and branch lengths are calculated using the given protein sequences while the tree topology is kept constant. The resultant tree with rooting and optimized branch lengths will be located in 'foo.opl.hypno.tree' (where 'foo' is the input MSA filename). When optimization is not invoked, the HYPNO output contains only a "topology tree" for which branch lengths have not been recalculated. --oplnuc Branch length optimization using nucleotide MSA: When specified, output tree undergoes midpoint rooting and branch lengths are calculated using the retrieved nucleotide sequences while the tree topology is kept constant. The resultant tree with rooting and optimized branch lengths will be located in 'foo.opl.hypno.tree' (where 'foo' is the input MSA filename). When optimization is not invoked, the HYPNO output contains only a "topology tree" for which branch lengths have not been recalculated.