/HYPNO

HYbrid Protein-NucleOtide phylogenetic gene tree reconstruction

Primary LanguagePythonOtherNOASSERTION

HYPNO - HYbrid Protein NucleOtide phylogenetic gene tree reconstruction
================================================================================

HYPNO is a collection of Python scripts that can improve fine-branching order
(topology) of phylogenetic gene trees via clade reconstruction using nucleotide
information to differentiate between nearly identical protein sequences.

================================================================================


HYPNO VERSION 1.0 SOFTWARE DEPENDENCIES
=====================

Python 2.7      (http://www.python.org/getit/releases/2.7/)
Biopython 1.6      (http://biopython.org/wiki/Download)
EMBOSS Needle   (http://www.ebi.ac.uk/Tools/psa/emboss_needle/)
FastTree        (http://www.microbesonline.org/fasttree/#Install)

INSTALLATION
=====

For dependency installation, please refer to the above URLs, which contain download and
installation information for Windows, Mac, and Linux operating systems. Python, EMBOSS, 
and FastTree executables must be included in the system PATH environment variable. These
may either be added by default during installation or manually added by the user afterwards.
At this point, the user should be able to invoke the HYPNO software from the commandline
through the following usage options.

USAGE
=====

$ python HYPNO.py --msa <msafile> -tree <treefile> [options...]

Options:
    --k <PPWID>         Minimum pairwise percent identity (PPWID) for 
                        determining subtree selection (default: 90.0).

    --n <PPPID>         Minimum "predicted protein" percent identity (PPPID) 
                        between retrieved nucleotide sequence and expected
                        sequence for retrieved sequence to be accepted 
                        (default: 95.0).

    --s <PCS>           Minimum percent of correct sequences (PCS) required in 
                        an acceptable tree. This values depends on --n <PPPID>. 
                        Any value less than 100.0 may result in an output tree 
                        with fewer taxa than the input tree. (default: 100.0)

    --opl               When specified, output tree undergoes midpoint rooting 
                        and branch lengths are calculated using the protein MSA
                        while tree topology is kept constant.

    --oplnuc            When specified, output tree undergoes midpoint rooting 
                        and branch lengths are calculated using the nucleotide MSA
                        while tree topology is kept constant.

EXAMPLE DATA
============

Example input files, “large_sample.afa” and “large_sample.nwk” (sample tree containing 
120 leaves) as well as “small_sample.afa” and “small_sample.nwk” (sample tree containing 
only 17 leaves) are also provided in the HYPNO source code repository. Suggested usage 
for these datasets is as follows:

    python HYPNO.py --msa large_sample.afa --tree large_sample.nwk --k 80 --n 80 –-opl
    python HYPNO.py --msa small_sample.afa --tree small_sample.nwk --k 70 --n 80 --opl

For a description of how to interpret which subtrees have been refined by HYPNO, please
refer to the HYPNO FAQ at http://phylogenomics.berkeley.edu/HYPNO/help/.

OUTPUTS
=======

For an input MSA file 'foo.msa', the program outputs:

    foo.hypno.msa       This file is the nucleotide MSA in Aligned FASTA format.
    foo.hypno.tree      This file contains the re-estimated tree topology in 
                        Newick format.

Also, a timestamped folder containing intermediate and debug file information is created.

EXTENDED USAGE
==============

The script is invoked from the command line and the user must specify the path 
to the multiple sequence alignment file (MSA-file), encoded in FASTA or 
UCSC a2m format along with the original protein tree (tree-file) in Newick
format. Both files must contain valid Uniprot accession for each sequence/leaf 
that are consistent between files. Additional options may be specified to 
refine subtree definition and nucleotide and tree acceptance criteria.

$ python HYPNO.py --msa <msafile> -tree <treefile> [options...]

Options:
    --msa <msafile>     Input Multiple Sequence Alignment: the user must provide 
                        this argument followed by the path to an alignment of amino 
                        acid sequences, with UniProt accessions included in the 
                        sequence headers. HYPNO will retrieve the corresponding 
                        nucleic acid sequences and use the provided alignment as a 
                        template for constructing a nucleotide MSA.

    --tree <treefile>   Tree re-estimation: specifying this argument along with the 
                        path of the input tree results in re-estimation of tree 
                        topology based on HYPNO retrieved nucleotide sequences. 
                        If this argument is not provided, HYPNO builds and outputs 
                        the nucleotide MSA without generating a tree.

    --k <PPWID>         Subtree selection: Minimum pairwise percent identity (PPWID) for 
                        determining subtree grouping (default: 90.0). Adjusting 
                        this parameter may alter the size of subtrees for which 
                        topology is reevaluated. Very large values may result in
                        no subtrees and no change in tree topology while very
                        small values may result in overly large subtrees and 
                        coarse taxa grouping. To override the default value of 
                        90 percent, type “--k <X>”, where X is a real number 
                        between 0 and 100. For example, to set the subtree selection 
                        value to 93, one should invoke “--k 93.0”.

    --n <PPPID>         PercentID match: Minimum "predicted protein" percent identity
                        (PPPID) between retrieved nucleotide sequence and expected 
                        sequence for retrieved sequence to be accepted 
                        (default: 95.0). This parameter determines acceptance 
                        criteria for retrieved nucleotide sequences. retrieved 
                        nucleotide sequences are translated to protein using 
                        appropriate codon translation tables. If the percent 
                        identity between the translated retrieved sequence and 
                        the original protein sequence falls below the threshold, 
                        the nucleotide sequence is not considered representative 
                        of the protein sequence and is discarded from the 
                        alignment and subsequently the tree. To override the 
                        default value of 95 percent, type “--n <X>”, where X 
                        is a real number between 0 and 100. For example, to 
                        set the percentID match value to 80, one should invoke 
                        “--n 80.0”.

    --s <PCS>           Retrieval minimum: Minimum percent of correct sequences 
                        (PCS) required in an acceptable tree. This values depends 
                        on --n <PPPID>. Any value less than 100.0 may result in 
                        an output tree with fewer taxa than the input tree. 
                        (default: 100.0) This parameter determines what fraction 
                        of sequences can be discarded from the alignment and 
                        subsequently the tree as a result of the unavailability 
                        of sequences or poor identity between protein sequence and 
                        retrieved nucleotide sequence. Setting the PPPID and PCS value 
                        very high (overly stringent) may result in the production 
                        of no output tree.

    --opl               Branch length optimization using protein MSA: When specified, 
                        output tree undergoes midpoint rooting and branch lengths are 
                        calculated using the given protein sequences while the tree 
                        topology is kept constant. The resultant tree with rooting and 
                        optimized branch lengths will be located in 'foo.opl.hypno.tree' 
                        (where 'foo' is the input MSA filename). When optimization is not 
                        invoked, the HYPNO output contains only a "topology tree" for 
                        which branch lengths have not been recalculated.

    --oplnuc            Branch length optimization using nucleotide MSA: When specified, 
                        output tree undergoes midpoint rooting and branch lengths are 
                        calculated using the retrieved nucleotide sequences while the tree 
                        topology is kept constant. The resultant tree with rooting and 
                        optimized branch lengths will be located in 'foo.opl.hypno.tree' 
                        (where 'foo' is the input MSA filename). When optimization is not 
                        invoked, the HYPNO output contains only a "topology tree" for 
                        which branch lengths have not been recalculated.