Pipeline for comparative analysis of potentially unlimited number of RepeatExplorer runs
Pipeline performs clustering of consensues and 'other' contigs (optionally) from potentially unlimited number of RepeatExplorer runs based on sequences homology that detecting using BLASTn.
Main steps of pipeline's work:
-
Collection of all sequences for analysis (consensuses, references (if presented) and contigs (optionally)) in one FASTA file
-
Spliting FASTA file in chunks for parallel BLAST
-
All to all chunks BLAST for building of the connectivity table
-
Dropping of junk alignments using K-means or agglomerative clustering
-
Building a graph in memory using QuickUnion algorithm
-
Extraction all connectivity components in graph
-
Filtration of connectivity components based on their content
-
Writing of selected superclusters in separate FASTA
-
Filtering contigs if necessary using BLAST
-
Report generation with information about each selected supercluster
Install Anaconda with python 3 from official website
Download and install BLAST+ from official website. Check the presence of path to installation folder in PATH variable.
git clone https://github.com/alermol/REcomp2.git
cd REcomp2/
conda env create -f environment.yaml
conda activate recomp
cd REcomp/
./test.py
If test run has been done successfully than pipeline is ready to work
Pipeline was tested on Ubuntu 18.04.3
The help message and available options can be accessed using
./REcomp.py -h # short option
./REcomp.py --help # long option
which gives the following output
usage: REcomp.py [-h] [-v] [-r REF] [-l] [-c CPU] [-io] [-ir]
[--evalue EVALUE] [--low-memory] [-ss {blastn,megablast}]
path prefix out
positional arguments:
path path(s) to RE results (top level)
prefix prefix(es) for each paths
out path to output directory
optional arguments:
-h, --help show this help message and exit
-v, --version show version
-r REF, --references REF
path to fasta with references repeats
-l save logfile in output directory
-c CPU number of CPU to use
-io, --include-other include 'other' contigs and clusters in analysis (default: False)
-ir, --include-ribosomal
include rDNA clusters (rank 4) in analysis (default: False)
--evalue EVALUE evalue threshold for alignments for supercluster assembly (default: 1e-05)
--low-memory use small amount of RAM for 'all to all' blast by using small chunk size (1000)
but it can take much time (default chunk size: 10000)
-ss {blastn,megablast}, -superclusters-search {blastn,megablast}
alignments for union of sequences in supercluster can be performed either
blastn or megablast (default): blastn is slower and required more RAM
but more sensitive
The details of each option are given below:
Expects: STRING
Default: None
The input for this argument is string in quotes that contains space separated paths to top level of RepeatExplorer results. Paths must not be repeated. This argument is required.
Example
"~/RE_result1 ~/RE_result2"
Expects: STRING
Default: None
This argument take in a string in quotes that contains space seaparted prefixes for each paths in path
argument. Each one prefix must be matching only one path. Prefixes must not be repeated. Short prefixes are recomended. This argument is requered.
Example
"p1 p2"
Expect: STRING (to be used as a path to directory)
Default: None
This is the last required argument for REcomp2. Points to directory for results saving. If directory does not exist, it will be created.
Expects: STRING (to be used as a path to file)
Default: None
File containing references in FASTA format. Files with multiple sequences are also valid. Multi-FASTA must not contain identical records id.
Expects: None
Default: False
If this argument in set REcomp.log
will be saved in output directory.
Expects: INTEGER
Default: All system cores
Number of cores for REcomp2 work.
Expects: None
Default: False
By default the 'other' clusters and contigs does not included in analysis. If this arguments in set, contigs from 'other' clusters is included in analysis that can dramatically increase worktime, especially if --low-memory
(see below) is turn on.
Expects: None
Default: False
This parameter activates including in analysis the rDNA consensuses (rank 4).
Expects: FLOAT >= 0.0
Default: 1e-05
E-value threshold for BLAST alignment. Alignments with E-value greater than this threshold are not counted.
Expects: None
Default: False
If this parameters in set, fasta file will be splitted in chunk with less size. This allow to use less amount of memory during 'all to all blast' step.
Expect: STRING (blast or megablast)
Defailt: megablast
Alignments for union of sequences in superclusters can be performed either blastn
or megablast
. blastn
is slower and required more RAM but more sensitive.
Prints the version info of REcomp2
# find all superclusters using default parameters
./REcomp.py '~/RE_result1 ~/RE_result2' 'p1 p2' ~/REcopm2_output -r ~/references.fasta
# find all superclusters using default parameters and log-file
./REcomp.py '~/RE_result1 ~/RE_result2' 'p1 p2' ~/REcopm2_output -r ~/references.fasta -l
# find all superclusters incliding 'other' using default parameters
./REcomp.py '~/RE_result1 ~/RE_result2' 'p1 p2' ~/REcopm2_output -r ~/references.fasta -io
# find all superclusters incliding 'other' using less memory
./REcomp.py '~/RE_result1 ~/RE_result2' 'p1 p2' ~/REcopm2_output -r ~/references.fasta -io --low-memory