/phySCO

Phylogenomics from Single Copy Orthologs

Primary LanguagePython

Welcome to phySCO!

phySCO is a python script that infers maximum likelihood phylogenomic tree using BUSCO single-copy orthologous genes. phySCO retrieves these genes from already-available BUSCO results (that is, phySCO does not run BUSCO itself).

Current version: 1.1.0

Many thanks to NiccoloRighetti, whose work has pushed me to start writing this code.

General description

Given a directory containing BUSCO results for a group of species, phySCO computes the maximum likelihood phylogenetic tree of all the species.

To run phySCO, just create a dedicated directory and copy into it BUSCO results for the species you want to analyse.

Mind to keep the default structure of BUSCO results.

phySCO is able to process both amino acid and nucleotide sequences of complete single-copy BUSCO genes. Nonetheless, its default behaviour is to work with amino acid sequences, as they are always returned by any BUSCO analysis run mode (either genome, transcriptome or protein).

Required softwares and dependencies

Here is the list of software that phySCO requires:

  • python (v3.11)
  • mafft (v7.505)
  • trimal (v1.4.rev15)
  • IQTREE2 (2.1.4-beta COVID-edition)

You can install all of them directly from the phySCO_env.yml YAML file, through the command:

conda env create -f phySCO_env.yml
conda activate phySCO_env

Example dataset

example_dataset/ is a test dataset that can be used directly as an input to phySCO. example_dataset_key.md contains metadata of the example dataset.

It has been generated by simply running BUSCO on a random set of mammal NCBI reference genomes. You can find the keys to the species identifier in example_dataset_key.md.

Before running phySCO on the example dataset, run the following commands:

mkdir example_dataset_extracted
for i in example_dataset/*tar.gz; do tar -xvzf $i -C example_dataset_extracted; done

then run

python3 phySCO.py -i example_dataset_extracted/