busco2phylo-nf

Nextflow pipeline for turning a set of BUSCO results into a species tree using two approaches - concatenation and gene trees.

Dependencies

conda
nextflow

Different steps of the pipeline use mafft, iqtree and trimal.

In the example config file the environment name is orthology_env. You can create a working a environment with conda create -n orthology_env -y -c bioconda mafft iqtree trimal.

Executing the pipeline on the Sanger farm

Clone this repository:

git clone https://github.com/lstevens17/busco2phylo-nf.git

Submit an interactive job. Choose a queue appropriate to the time of execution of your pipeline. Take into account that the default config submits jobs to the 'normal' queue of 12 hours execution and if the job fails it will be resubmitted into the 'long' queue.

mbMem=5000; bsub -n 1 -q long -R"span[hosts=1] select[mem>${mbMem}] rusage[mem=${mbMem}]" -M${mbMem} -Is bash

Launch the pipeline:

nextflow -C busco2phylo-nf/busco2phylo.config run busco2phylo-nf/busco2phylo.nf --busco_dir_path [absolute_path_to_BUSCO_directory] -profile farm

Note: the busco_dir_path should contain a set of BUSCO results with sensible names. The directory name will be used to create the sequence headers in the FASTAs and therefore become the tip labels in the resulting trees.

Notes on running the pipeline

This pipeline serves as a framework of basic commands that can be executed sequentially. However, these commands should be carefully evaluated before running and altered as appropriate in order to be suitable for (i) a given dataset and (ii) generating a tree suitable to answering your downstream questions of interest.

Busco2fasta.py - this makes one fasta file per BUSCO where each entry in the file is the sequence of the BUSCO in a given species. By default, all species need to have a given BUSCO for a given BUSCO to have a fasta file created. This behaviour can be tuned via the -p flag e.g. try 0.8-100 inclusion. You may then choose a cutoff which strikes a balance between missingness of data (i.e. more species lacking a BUSCO) and loss of data (filtering out more BUSCOs). The other consideration is whether you are working with amino acids (suited for longer evolutionary distances) or nucleotides. Based on this, set the BUSCO path to the amino acid output of BUSCO or nucleotide output.

Reformat_fastas - to strip headers of all information other than species ID.
Align_fastas - by defalt this uses mafft as an aligner with default settings. Consider whether this is appropriate for your analysis.

At this point the methods for a concatenation vs gene tree-based approach differ.

Concatenation approach

Trim_alignments - trimming via trimal is done by default. This potentially removes regions of the alignment that are very poorly aligned. However, you could be removing varying regions carying important phylogenetic signal. Consider whether it is appropriate to trim for your data.
Concatenate_alignments - make one large supermatrix where each species is represented by one line.
Infer_supermatrix_tree - run iqtree on supermatrix. By default, the pipeline specifes a GTR20+G4 model of sequence evolution. Again, consider is this is suitable for you. GTR20 infers 190 rate parameters so both very computationally expensive, slow and could be overparameterise. Could model test (see below) or use prior knowledge of what could be suitable e.g. LG+G4.

Gene tree approach

Infer gene trees - infers a gene tree per BUSCO alignment using iqtree. By default, iqtree will first test different models of sequence evolution before then picking the best model to then infer the gene tree. This can take very long! If you have many species per BUSCO, reccomend either choosing a model (set via -m e.g. -m LG+G4) or you could first run model selection on a subset (or all) BUSCOs, then run iqtree on all BUSCOs using the most common best model. Iqtree supports model selection without tree inference: iqtree -s MSA -m MF, see iqtree docs for all options.
Run_astral - infers species tree from the set of gene trees. As gene trees do not contain substitution information, branch lengths are not inferred.
estimate_astral_BLs - iqtree specifies the tree generated by astral as a fixed topology, then uses the supermatrix from the supermatrix-based approach to infer branch lengths. Again, need to provide a model of sequence evolution, by default its GTR20+G4.