detREC_program: A C++ repository from qianfeng2

A Program to Detect Recombinants From Unaligned Sequences

About

This program is a novel approach for detecting recombinant sequences and corresponding statistical support values from unaligned biological sequences. This framework develops on the basis of the paritial alignment results from jumping hidden markov model (JHMM, or mosaic), after that, by dividing them into multiple equal-length triples, on which we use a new distance-based procedure to identify recombinant from each triple. Statistical support values calculated from Bootstrap, the bigger the better, indicating the robustness of identified recombinants.

Required softwares

MAFFT used to align one sequence to another two pre_aligned sequences (https://mafft.cbrc.jp/alignment/software/)
SeqKit used to concatenate the two segments for each triple (https://bioinf.shenwei.me/seqkit/download/)
Python >=3.5

pip install -r requirements.txt

Optional softwares (only used for simulation section)

Msprime used to generate one phylogenetic tree topology
INDELible used for simulation of biological sequences when considering indel events (Fletcher et al., 2009)
Python module
Pyvolve used for simulating biological sequences given a tree topology (Spielman et al. 2015)
sklearn
statistics
csv
Snakemake for reproducible research pipeline

Required Input Files

Input fasta format biological sequences, maximum length for identifiers length is 15 (Zilversmit et al., 2013)
Patial alignment produced by JHMM (please see MZmosaic sub folder)

Creating Input File

Run Example

cd /Users/fengqian/MZmosaic

./mosaic -estimate -seq input.fasta  -rec 0 -aa -tag middle_file

delta=$(grep -o 'Gap initiation: .*$' middle_file_align.txt | cut -c17-)

epsl=$(grep -o 'Gap extension:  .*$' middle_file_align.txt | cut -c17-)

./mosaic -seq input.fasta -del $delta -eps $epsl -aa -tag output -grid 0.001 0.010 10 1

Run Example for large number of sequences (>10000 sequences)

Instead of complete and time-consuming Baum-Welch algorithm to estimate gap open and gap extension probabilities, the slightly less accurate but much faster viterbi training algorithm was used. Each iteration was run as on a high performing computing cluster (Helix)

Iterate until convergence:

Choose an initial set of parameters
Compute the Viterbi paths of all sequences
Count frequencies of events and calculate new parameters
Update -> 1)
Stop when the major parameters del and eps change by less than 1%.

The empirical Ghana pilot DBLa dataset analyzed in manuscript contain more than 17000 sequences, the detailed code for generating partial alignment results are displayed in Empirical_script sub folder.

Running recombination detection program

Required:

"<filename>": name of the file which contains your preprepared partial alignment information.
"<filename>": name of the file which contains your complete biological sequences, fasta format required.
"<filename>": name of the file which contains your list of identified recombinants and related statistical support value, csv format required.

Type in above three input parameters in order.

Output

Produces a series of files based on various stage of the implementation of recombination detection program, and places them in the directory specified by output.

temp file folder
This folder provides all the chunks containing original triple and MAFFT processed fasta files.
complement_chunks file folder
This folder provides all the equal-length triples, name of each file indicates chunk index, two adjacent segment indices, and identified bkp in this triple.
output csv file:
Each row records the chunk index in partial alignment result, target, db1 and db2 are three sequences ID for each triple, rec is the identified recombinant ID from this specific triple, sv is the support value. For instance:

chunk	target	db1	db2	rec	sv
2	seq3	seq5	seq2	seq3	1
6	seq7	seq8	seq1	seq8	0.58
...	...	...	...	...	...

Run Example

mkdir test

cp Test_files/input.fasta Test_files/output_align.txt Test_files/integrated_rec_det.py test

cd test

python integrated_rec_det.py output_align.txt input.fasta output.csv

Test_files sub folder, as a toy example, provides a test input.fasta and all the middle and final output files.

Reference

Zilversmit, M. M., Chase, E. K., Chen, D. S., Awadalla, P., Day, K. P., & McVean, G. (2013). Hypervariable antigen genes in malaria have ancient roots. BMC evolutionary biology, 13(1), 110.
Fletcher, W., & Yang, Z. (2009). INDELible: a flexible simulator of biological sequence evolution. Molecular biology and evolution, 26(8), 1879-1888.
Spielman, S. J., & Wilke, C. O. (2015). Pyvolve: a flexible Python module for simulating sequences along phylogenies. PloS one, 10(9), e0139047.

qianfeng2/detREC_program

A Program to Detect Recombinants From Unaligned Sequences

About

Required softwares

Optional softwares (only used for simulation section)

Required Input Files

Creating Input File

Run Example

Run Example for large number of sequences (>10000 sequences)

Running recombination detection program

Required:

Output

Run Example

Reference