phylen

Phylen is an R package that performs automatic phylogenetic reconstruction given a set of Hidden Markov Models (HMMs). Genomes are screened against these HMMs, genes found in all genomes ("core genes") are aligned individually, those alignments are concatenated into a single supergene alignment, and a phylogenetic reconstruction is performed and returned as an object of class "phylo" so it can be further analysed using ape/phangorn framework in R. Functions to download well curated HMMs from clade-specific orthologous sets from the EggNOG database are provided although any custom set of HMMs can be used as well.

Installation

The easiest way to install this package is using devtools package:

if(!require(devtools)){
   install.packages("devtools")
}
devtools::install_github("iferres/phylen")

Requirements

phylen depends on HMMER 3.1b2 and MAFFT. You should have them installed on you $PATH variable prior to using this software. Besure that the following MAFFT aliases are also installed: linsi (alias for mafft --maxiterate 1000 --localpair), ginsi (alias for mafft --maxiterate 1000 --globalpair) and einsi (alias for mafft --ep 0 --maxiterate 1000 --genafpair). This shortcuts are installed together with mafft if you download and install the software from the MAFFT webpage (link above), but probably not if use a package manager as apt or brew.

It also depends on phangorn package, which in turn depends on igraph. Some system requirements are needed to install the latter, please check them here.

Standard workflow

This tutorial begins with the extraction of toy data attached on this package. It consists in 10 genomes of the Campylobacterales order, 5 of them are Campylobacter species and the other 5 are Helicobacter species.

Note: phylen input files must be in gff3 format, as returned by the prokka annotation software.

# List the attached tar.gz file
tgz <- system.file('extdata', 'toydata.tar.gz', package = 'phylen')
# List the files inside it
gff <- untar(tarfile = tgz, exdir = getwd(), list = T)
# Decompress on current working directory
untar(tarfile = tgz,exdir = getwd())
# List the decompressed files
gff

##  [1] "C_fetus_subsp_testudinum_Sp3.gff"                   
##  [2] "C_fetus_subsp_venerealis_str_84-112.gff"            
##  [3] "C_hyointestinalis_subsp_lawsonii_CCUG_27631.gff"    
##  [4] "C_iguaniorum_str_RM11343.gff"                       
##  [5] "C_pinnipediorum_subsp_pinnipediorum_str_RM17262.gff"
##  [6] "H_bilis_str_AAQJH.gff"                              
##  [7] "H_himalayensis_str_YS1.gff"                         
##  [8] "H_pylori_J166.gff"                                  
##  [9] "H_pylori_str_2018.gff"                              
## [10] "H_typhlonius_str_1.gff"

Let's load the phylen package and list the available Hidden Markov Models (HMMs) on the EggNOG database.

# Load the phylen package
library(phylen)

## Loading required package: phangorn

## Loading required package: ape

# List first 50 available sets of EggNOG
list_eggnogdb()[1:50, ]

##                level.name nog.prefix
## 1                    LUCA        NOG
## 2           Acidobacteria     aciNOG
## 3          Acidobacteriia    acidNOG
## 4            Aconoidasida     acoNOG
## 5          Actinobacteria     actNOG
## 6              Agaricales     agaNOG
## 7          Agaricomycetes    agarNOG
## 8             Apicomplexa     apiNOG
## 9    Proteobacteria_alpha    aproNOG
## 10              Aquificae     aquNOG
## 11                Archaea      arNOG
## 12           Archaeoglobi     arcNOG
## 13             Arthropoda     artNOG
## 14      Arthrodermataceae    arthNOG
## 15             Ascomycota     ascNOG
## 16                   Aves     aveNOG
## 17                Bacilli     bacNOG
## 18               Bacteria    bactNOG
## 19            Bacteroidia   bacteNOG
## 20          Basidiomycota     basNOG
## 21          Bacteroidetes    bctoNOG
## 22              Bilateria      biNOG
## 23    Proteobacteria_beta    bproNOG
## 24            Brassicales     braNOG
## 25              Carnivora     carNOG
## 26          Chaetomiaceae     chaNOG
## 27               Chlorobi     chlNOG
## 28             Chlamydiae    chlaNOG
## 29            Chloroflexi    chloNOG
## 30            Chloroflexi   chlorNOG
## 31            Chlorophyta  chloroNOG
## 32               Chordata    chorNOG
## 33            Chromadorea     chrNOG
## 34             Clostridia     cloNOG
## 35               Coccidia     cocNOG
## 36          Crenarchaeota     creNOG
## 37      Cryptosporidiidae     cryNOG
## 38          Cyanobacteria     cyaNOG
## 39             Cytophagia     cytNOG
## 40      Debaryomycetaceae     debNOG
## 41        Deferribacteres     defNOG
## 42      Dehalococcoidetes     dehNOG
## 43     Deinococcusthermus     deiNOG
## 44          delta/epsilon     delNOG
## 45                Diptera     dipNOG
## 46        Dothideomycetes     dotNOG
## 47   Proteobacteria_delta    dproNOG
## 48          Drosophilidae     droNOG
## 49 Proteobacteria_epsilon    eproNOG
## 50        Erysipelotrichi     eryNOG

The order Campylobacterales belongs to the class Epsilonproteobacteria, which is listed at row number 49. The corresponding set of HMMs is, then, eproNOG.

We will download the HMMs directly from the EggNOG servers using the download_nog_hmm() function.

hmm <- download_nog_hmm(nog.prefix = "eproNOG", onDir = getwd())
hmm

## [1] "/home/iferres/Documents/eproNOG.hmm.tar.gz"

Now we have the HMMs, so we can proceed to run the main function in order to obtain a core genome alignment, and a phylogeny. This would take ~20-30 min using 4 CPUs.

p <- phylen(gffs = gff, # The gff files extracted on the first step
            hmmFile = hmm, # The downloaded HMMs
            isCompressed = TRUE, # The downloaded HMMs are compressed (tar.gz)
            phyloMode = "ml", # nj or ml, in this case we will use maximum likelihood
            nbs = 100, # The number of bootstrap.
            level = 100, # The percentage of genomes a gene must be in to be considered as part of the coregenome.
            outDir = "phylen", # Lets create a directory called "phylen" to put the output files
            n_threads = 4) # Use 4 threads

## Decompressing.. DONE!
## Concatenating.. 
## ===========================================================================
## DONE!
## Pressing.. DONE!
## Getting information from hmms.. DONE!
## Searching.. DONE!
## Computing panmatrix.. DONE!
## Getting core-genes.. DONE!
## Writting fastas.. DONE!
## Aligning.. DONE!
## Concatenating.. DONE!
## Removing intermediate files.. DONE!
## Generating phylogeny..
## optimize edge weights:  -5958637 --> -5731557 
## optimize base frequencies:  -5731557 --> -5686007 
## optimize rate matrix:  -5686007 --> -5621016 
## optimize edge weights:  -5621016 --> -5619537 
## optimize base frequencies:  -5619537 --> -5616203 
## optimize rate matrix:  -5616203 --> -5615171 
## optimize edge weights:  -5615171 --> -5615158 
## optimize base frequencies:  -5615158 --> -5614921 
## optimize rate matrix:  -5614921 --> -5614863 
## optimize edge weights:  -5614863 --> -5614861 
## optimize base frequencies:  -5614861 --> -5614846 
## optimize rate matrix:  -5614846 --> -5614842 
## optimize edge weights:  -5614842 --> -5614842 
## optimize base frequencies:  -5614842 --> -5614841 
## optimize rate matrix:  -5614841 --> -5614841 
## optimize edge weights:  -5614841 --> -5614841 
## optimize base frequencies:  -5614841 --> -5614841 
## optimize rate matrix:  -5614841 --> -5614841 
## optimize edge weights:  -5614841 --> -5614841

## Generating phylogeny.. DONE!
## 
## Finished: 658 groups of orthologous from 10 isolates have been used in the alignment.
## Returning an object of class "phylo" with 10 tips and 8 nodes.

Finally we have a "phylo" object, and we can continue analizing it with phangorn and ape packages.

# Print it
p

## 
## Phylogenetic tree with 10 tips and 8 internal nodes.
## 
## Tip labels:
##  C_fetus_subsp_testudinum_Sp3, C_fetus_subsp_venerealis_str_84-112, C_hyointestinalis_subsp_lawsonii_CCUG_27631, C_iguaniorum_str_RM11343, C_pinnipediorum_subsp_pinnipediorum_str_RM17262, H_bilis_str_AAQJH, ...
## Node labels:
##  100, 100, 100, 100, 100, 100, ...
## 
## Unrooted; includes branch lengths.

# Class?
class(p)

## [1] "phylo"

# Plot it
plot(p, type = 'unrooted', cex = 0.7, lab4ut = 'axial')

Any questions?

If you have any question about usage, implementation, or result interpretation, please feel free to ask. Send them to the issue tracker or directly to my e-mail (iferres@pasteur.edu.uy). If you think your concern could help others, we encourage you to use the first channel of communication.

Contributing

If you want to contribute to the development of this software, please refer to the contributing guidelines.

Citation

Ferrés et al., (2018). Phylen: automatic phylogenetic reconstruction using the EggNOG database. Journal of Open Source Software, 3(25), 593, DOI: 10.21105/joss.00593