/aPhyloGeo-pipeline

Phylogeographic workflow using sliding-windows, RAxML-NG and FastTree

Primary LanguagePythonMIT LicenseMIT



🐍 Snakemake workflow: aPhyloGeo

Snakemake GitHub actions status License: MIT Contributions Py version Hits GitHub release

       A Snakemake workflow for phylogeographic analysis.

Table of Contents
  1. About the project
  2. Dependencies
  3. Getting started
  4. Citation
  5. Contact

📝 About the project

    aPhyloGeo-pipeline is a user-friendly, scalable, reproducible, and comprehensive workflow that can explore how patterns of variation within species coincide with geographic features, such as climatic features.By incorporating user-defined parameters such as fragment size (window size) and sliding window advancement step (step size), the pipeline conducts a thorough scan of multiple sequence alignment (MSA) and performs a joint analysis with environmental data to identify gene fragments that are strongly associated with specific environmental factors.

    To investigate the potential correlation between the diversity of specific genes or gene fragments and their geographic distribution, a sliding window strategy was employed in addition to traditional phylogenetic analyses. Firstly, the multiple sequence alignment (MSA) was partitioned into windows by specifying the sliding window size and step size. Then a phylogenetic tree for each window was constructed. Secondly, cluster analyses for each geographic factor were performed by calculating a distance matrix and creating a reference tree based on the distance matrix and the Neighbor-Joining clustering method (Cardoso et al., 2022). Reference trees (based on geographic factors) and phylogenetic trees (based on sliding windows) were defined on the same set of leaves (i.e., names of species). Subsequently, the correlation between phylogenetic and reference trees was evaluated using the Robinson and Foulds (RF) distance calculation. RF distances were calculated for each combination of the phylogenetic tree and the reference tree. Finally, bootstrap and RF thresholds were applied to identify gene fragments in which patterns of variation within species coincided with a particular geographic feature. These fragments can serve as informative reference points for future studies.

⚒️ Dependencies

The workflow includes the following Python packages:

The workflow includes the following bioinformatics tools:

The software dependencies can be found in the conda environment file.

Getting started

1. Clone this repo.

git clone https://github.com/tahiri-lab/aPhyloGeo-pipeline.git
cd aPhyloGeo-pipeline

2. 🚀 Install dependencies.

2.1 If you do not have Conda installed, then use the following method to install it. If you already have Conda installed, then refer directly to the next step (2.2).

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh

# install Conda (respond by 'yes')
bash miniconda.sh

# update Conda
conda update -y conda

2.2 Create a conda environment named aPhyloGeo and install all the dependencies in that environment.

# create a new environment with dependencies 
conda env create -n aPhyloGeo -f environment.yaml

2.3 Activate the environment

conda activate aPhyloGeo

3. Configure the workflow

  • Prepare the config file:

    Modify the values of parameters and threshold in the config.yaml according to the research needs.

    Note: Please modify the corresponding values and do NOT change the parameter names or file names.

  • Prepare the input files:

    • A file of multiple sequences alignment in FASTA format
    • A CSV file includes environmental data on the geographical habitat of the studied species

4. Execute the workflow.

Run workflow

# In a conda environment where all dependencies are already installed
# Specify the maximum number of CPU cores to be used at the same time.
# To use N cores: --cores N or -cN.

snakemake --cores all

Even with not created and activated the conda environment as required in 2.2 and 2.3 is possible by running the workflow successfully with '--use-conda'. Snakemake will create a temporary conda environment.

# To specify the maximum number of CPU cores to be used at the same time. 
# 	With N cores: --cores N or -cN. 
# 	For all cores in the system: --cores all. 

snakemake --use-conda --cores all

✔️ Citation

1️⃣ A manuscript for aPhyloGeo-pipeline is in preparation.

2️⃣ The usage of this workflow is described in the Snakemake Workflow Catalog. If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) sitory and its DOI (see above).

📧 Contact

Please email us at: Nadia.Tahiri@USherbrooke.ca for any questions or feedback.