CorEvol is a lightweight Python pipeline to calculate ω (dN/dS) values for the core genome of multiple genomic CDS sequences. The general workflow is described in the flowchart below.
CorEvol processes a folder of your genomic CDS files and outputs the ω (dN/dS) values of the core, non-redundant, non-recombinant sequences, along with other useful files and folders. See Usage for more details.
graph LR
subgraph Input Files
A[seq1.fna, seq2.fna, ..., seqN.fna]
end
A --> B((Clustering<br/>&<br/>Redundancy removal))
B --> C[Core Clusters]
C --> X((Alignment))
X --> Y[Aligned Core Clusters]
Y --> D((Recombination<br/>Detection))
D --> E[Recombinant Seqs.]
D --> F[Non Recombinant Seqs.]
F --> G[ω of core,<br/>non-redundant,<br/>non-recombinant<br/>clusters]
The latest version of CorEvol is available here
- Create a conda environment.
conda create -n corevol_env python=3.8
conda activate corevol_env
- Install OpenRDP in the said environment.
conda install -c bioconda openrdp
- Clone the CorEvol repository.
git clone https://github.com/AbhishakeL/CorEvol.git
cd corevol
- Install required Python packages.
pip install -r requirements.txt
To use CorEvol, run the following command to get to know of all the available parameters and options:
python CorEvol.py -h
Usage options:
python CorEvol.py -h
usage: CorEvol.py [-h] -i DIRECTORY -o OUTPUT [-p PATH] [-c IDENTITY]
[-d LENGTH_DIFF] [-sc LENGTH_CUTOFF] [-aL ALIGN_COV_LONG]
[-aS ALIGN_COV_SHORT] [-g MEMORY] [-n WORD_LENGTH]
[-pC PHYLOGENY_CUTOFF] [-r RDP_CONFIG] [-x COUNTS]
[-t THREADS]
Pipeline for running CorEvol.
options:
-h, --help show this help message and exit
-i DIRECTORY, --directory DIRECTORY
Directory containing the FASTA files
-o OUTPUT, --output OUTPUT
Directory containing the output files
-p PATH, --path PATH Path where the cdhit program is located
-c IDENTITY, --identity IDENTITY
Sequence identity threshold (default: 0.9)
-d LENGTH_DIFF, --length_diff LENGTH_DIFF
Length difference cutoff (default: 0)
-sc LENGTH_CUTOFF, --length_cutoff LENGTH_CUTOFF
Length difference cutoff in amino acid (default: 1)
-aL ALIGN_COV_LONG, --align_cov_long ALIGN_COV_LONG
Alignment coverage for longer sequence (default: 0.9)
-aS ALIGN_COV_SHORT, --align_cov_short ALIGN_COV_SHORT
Alignment coverage for shorter sequence (default: 0.9)
-g MEMORY, --memory MEMORY
Maximum available memory in GB (default: 1)
-n WORD_LENGTH, --word_length WORD_LENGTH
Word length (default: 9)
-pC PHYLOGENY_CUTOFF, --phylogeny_cutoff PHYLOGENY_CUTOFF
Minimum length of nucleotides prior to alignment
(default : 300)
-r RDP_CONFIG, --rdp_config RDP_CONFIG
Path where internal parameters of RDP scanner is saved
-x COUNTS, --counts COUNTS
Number of different RDP testing methodology used to
confidently conclude a sequence to be recombinant
(default: 4, max: 6)
-t THREADS, --threads THREADS
Number of threads (default: 4)
We strongy recommend running CorEvol.py
from SPYDER Ipython terminal. This recommendation is mostly to circumnavigate a bug in the codeml
package.
%run CorEvol.py -i ./Test -o ~/Out
The output folder contains the following files and folders.
Name | Type | Content |
---|---|---|
Neutral_Selection | Directory | A directory containing all the cluster of sequences that have undergone neutral selection event |
NonRecombination_files | Directory | A directory containing _results.csv : Recombination detection output file , _formatted.phy and _tree.nwk files needed to run codeml for each of the clusters where recombination is not detected |
Positive_Selection | Directory | A directory containing all the cluster of sequences that have undergone positive selection event |
Purifying_Selection | Directory | A directory containing all the cluster of sequences that have undergone negative selection event |
Recombination_Clusters | Directory | A directory containing _results.csv : Recombination detection output file for each of the clusters where recombination is detected |
concatenated_seq.aln | File | ClustalW format alignment file for all the core, non-redundant, non-recombinant sequences, joined end to end |
concatenated_seq.fas | File | FASTA format alignment file for all the core, non-redundant, non-recombinant sequences, joined end to end |
Pan_matrix.xlsx | File | CD-HIT output, gnerated in amtrix format such that each row represent Cluster number and each column represent each sample. Background color of each row selected such that green represent ω >= 1,blue represent ω = 0, red represent ω < 1. Those without color are non-core/recombinant clusters |
RDP_error.tsv | File | Cluster numbers for which Recombination could not be detected because there were less than three unique sequences in the cluster |
Selection_table.tsv | File | A matrix with ω value for the clusters |
This project is licensed under the GNU GPLv3 License. See the LICENSE file for details.
This README file includes more detailed instructions for installation and usage, and it ensures that the content is well-organized and easy to follow.