/PHYling_unified

Unified PHYling pipeline for species tree building from annotated genomes (see https://github.com/stajichlab/AAFTF and https://github.com/nextgenusfs/funannotate for assembly and annotation steps)

Primary LanguagePythonMIT LicenseMIT

Conda Python

PHYling tool

The unified PHYling pipeline for phylogenomic data collection from annotated genomes.

This is latest iteration of tool for using phylogenetically conserved markers to pull out informative gene or protein info from genomic and transcriptomic datasets in order to construct gene trees and species phylogenies.

The aligned markers can be extracted from protein sequences for phylogenetic analyses and also projected into coding sequence alignments for codon-based analyses for better resolution of recently diverged species.

The assumptions in this approach are that the markers are generally single copy in genomes and taking best hit is sufficient first approximation for identifying orthologs. A separate file is parsed and file best_multihits which lists all the hits above the cutoff threshold for a given marker which can be used to assess duplication or attempt to incorporate paralogs into the analysis down the road.

The marker sets developed for this approach in fungi are available as part of the 1KFG Phylogenomics_HMMs project resource and preferred use of the BUSCO marker sets.

Flow chart

PHYling flowchart

New features compared to the original version

  • Using pyhmmer to improve the multithread performance in hmmsearch and hmmalign.
  • Implement all stuff in python. The entire program will be more readable and maintainable.
  • Simplify some steps and reduce the intermediate files as much as possible.
  • Muscle is now available for alternative alignment method.

Usage

First of all, install the package following the instruction below.

PHYling is a package to extract phylogenomic markers and build a phylogenetic tree upon them. It comprises 3 modules - download, align and tree. Use phyling --help to see more details.

positional arguments:
  {download,align,tree}
    download            Download HMM markers
    align               Run multiple sequence alignments against orthologs
                        found among samples
    tree                Build a phylogenetic tree based on multiple sequence
                        alignment results

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit

To test run on the example files, please cd into the folder example.

cd example

The folder example/pep includes 4 example peptide fasta - Afum.aa.fasta, Rant.aa.fasta, Scer.aa.fasta and Zymps1.aa.fasta.

Download HMM markerset

The download module is used to download the HMM markerset from BUSCO website. (Currently is updated to v5) See all options with phyling download --help.

positional arguments:
  HMM markerset or "list"
                        Name of the HMM markerset

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Verbose mode for debug
  -o OUTPUT, --output OUTPUT
                        Output directory to save HMM markerset
                        (default="./HMM")

Firstly, use download list to show the available BUSCO markersets.

phyling download list

Copy and paste the markerset to download it. Here we use fungi_odb10 as example.

phyling download fungi_odb10

Find the orthologs and align them

The align module identify the orthologs among all the samples using hmmsearch. HMM profiles that have matches on more than 3 samples are considered orthologs.

Before conducting hmmsearch, the module will first search for the bitscore cutoff file within the root HMM folder. If the cutoff file is not found, the reporting threshold for hmmsearch will be determined based on the -E/-evalue (default is 1e-10).

Once the orthologs are identified, the sequences extracted from each sample undergo multiple sequence alignment. By default, the alignment is performed using the hmmalign method. However, users have the option to switch to using muscle by specifying the -M/--method muscle flag.

By default, each alignment result is output separately and is expected to resolve their phylogeny by consensus tree method. If you prefer to use concatenate strategy. You can concatenate all the alignment by passing -c/--concat. See all the options with phyling align --help.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Verbose mode for debug
  -i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
                        Query pepetide fasta
  -I INPUT_DIR, --input_dir INPUT_DIR
                        Directory containing query pepetide fasta
  -o OUTPUT, --output OUTPUT
                        Output directory of the alignment results (default="./align")
  -m MARKERSET, --markerset MARKERSET
                        Directory of the HMM markerset
  -E EVALUE, --evalue EVALUE
                        Hmmsearch reporting threshold when score cutoff file is not found (default=1e-10)
  -M {hmmalign,muscle}, --method {hmmalign,muscle}
                        Program used for multiple sequence alignment (default="hmmalign")
  -n, --non_trim        Report non-clipkit-trimmed alignment results
  -c, --concat          Report concatenated alignment results
  -t THREADS, --threads THREADS
                        Threads for hmmsearch and the number of parallelized jobs in MSA step (default=1)

Run the align module with all the fasta files under folder pep.

phyling align -I pep -m HMM/fungi_odb10/hmms

An equivalent way to send inputs.

phyling align -i pep/*.fasta -m HMM/fungi_odb10/hmms

Or if you're just interested in part of the fasta, you can specify the inputs one-by-one.

phyling align -i pep/Afum.aa.fasta pep/Rant.aa.fasta pep/Scer.aa.fasta -m HMM/fungi_odb10/hmms

Note: Required at least 3 samples in order to build a tree!

Accelerate by using 16 cpus. According to pyhmmer benchmark the multithread acceleration drop dramatically when more cpu is used. When less then 8 cpus are given, the hmmsearch step will run in single-thread manner and all cpus will be used for each round of hmmsearch. When 8 or more cpus are given, the hmmsearch step will use 4 cpus for each parallel job. In this example, 4 hmmsearch jobs will run parallelly and each job utilize 4 cpus. For the alignment step, 16 parallel jobs will be launched and each parallel job is running in single-thread manner.

Highly recommended if muscle is chosen for alignment. (muscle is much slower than hmmalign!!)

phyling align -I pep -m HMM/fungi_odb10/hmms -t 16

Build tree on multiple sequence alignment results

Finally, we can run the tree module to use the multiple sequence alignment results to build a phylogenetic tree. It support both consensus tree (conclude the majority of trees which was built upon each single gene) and concatenated alignment method. See all the options with phyling tree --help.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Verbose mode for debug
  -i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
                        Multiple sequence alignment fasta
  -I INPUT_DIR, --input_dir INPUT_DIR
                        Directory containing multiple sequence alignment fasta
  -o OUTPUT, --output OUTPUT
                        Output directory of the newick treefile (default=".")
  -m {upgma,nj}, --method {upgma,nj}
                        Algorithm used for tree building (default="upgma")
  -f, --figure          Generate a matplotlib tree figure

Run the tree module with all the alignment results under folder align.

phyling tree -I align

You can also use only part of the alignment results to build tree.

phyling tree -i align/100957at4751.faa align/174653at4751.faa align/255412at4751.faa

Use Neighbor Joining algorithm instead of the default UPGMA method for tree building.

phyling tree -I align -m nj

Use matplotlib to generate a tree figure.

phyling tree -I align -f

Requirements and Installation

  • Python >= 3.7
  • Biopython
  • pyhmmer, a HMMER3 implementation on python3.
  • clipkit for trimming.
  • muscle for alternative method for multiple sequence alignment. (Optional)

Use the environment.yml to install all the required packages

conda env create -f environment.yml

Go into the phyling folder and install the package through pip

pip install .

Notes

  • Training your own marker set is also possible but most busco sets are good starting place.
  • The multiple sequence alignment results can also be sent to other phylogenetic tool like IQ-tree for tree building.