As an input StringDecomposer algorithm takes the set of monomers (typically, alpha satellites) and a genomic segment (assembly, Oxford Nanopore or a PacBio HiFi read) that contains a tandem repeat consisting of the given monomers. StringDecomposer partitions this segment into distinct monomers, providing an accurate translation from the nucleotide alphabet into the monomer alphabet.
The recommended way to install StringDecomposer is with conda package manager:
conda install -c bioconda stringdecomposer
Alternatively, StringDecomposer can be build and installed from source.
Requirements:
- Python3.5+
- g++ (version 5.3.1 or higher)
The required python packages can be installed through conda using
conda install --file requirements.txt
Local building without installation:
git clone https://github.com/ablab/stringdecomposer.git
cd stringdecomposer
make
Then, StringDecomposer is available as
bin/stringdecomposer
Installing from source:
git clone https://github.com/ablab/stringdecomposer.git
cd stringdecomposer
make install
Then, StringDecomposer is available as
stringdecomposer
Removal of StringDecomposer installed from source:
make uninstall
The following command assumes that StringDecomposer is either installed through conda or from source.
stringdecomposer ./stringdecomposer/test_data/read.fa ./stringdecomposer/test_data/DXZ1_star_monomers.fa -o ./stringdecomposer/test_data
The same result can be achieved with make test_launch
(for local build without installation) and
make test_launch_install
(for installed from source or via conda).
These make
rules ensure correctness of StringDecomposer's output on the test dataset.
In case StringDecomposer is built locally, the command that achieves the same result is
./bin/stringdecomposer ./stringdecomposer/test_data/read.fa ./stringdecomposer/test_data/DXZ1_star_monomers.fa -o ./stringdecomposer/test_data
Results can be found in
./stringdecomposer/test_data/final_decomposition.tsv final decomposition of sequences to monomer alphabet
./stringdecomposer/test_data/final_decomposition_alt.tsv final decomposition of sequences to monomer alphabet with alternative monomers for each position
./stringdecomposer/test_data/final_decomposition_raw.tsv raw decomposition with initial dynamic programming scores instead of identities
Each line in final_decomposition.tsv file has the following form:
<read-name> <best-monomer> <start-pos> <end-pos> <identity> <second-best-monomer> <second-best-monomer-identity> <homo-best-monomer> <homo-identity> <homo-second-best-monomer> <homo-second-best-monomer-identity> <reliability>
homo
-related columns represent statistics of the best-scoring (second-best-scoring) monomer after compression of homopolymer runs in both the monomer and the target read.
Reliability is either equal to ?
(signifies unreliable alignment which can be caused by a retrotransposon insertion or a poor quality segment of a read) or +
(if the alignment is reliable).
The columns <second-best-monomer>
, <second-best-monomer-identity>
, <reliability>
, and _homo_
-related columns will have values None
and -1
unless the user supplies the argument --second-best
(see Synopsis below).
stringdecomposer [-h] [-t THREADS] [-o OUT_FILE] [-i MIN_IDENTITY] [-s SCORING] [-b BATCH_SIZE] [--second-best] sequences monomers
Required arguments:
sequences fasta-file with long reads or genomic sequences (accepts multiple sequences in one file)
monomers fasta-file with monomers
Optional arguments:
-h, --help show this help message and exit
-t THREADS, --threads THREADS number of threads (by default 1)
-o OUT_FILE, --out-file OUT_FILE output tsv-file (by default final_decomposition.tsv)
-i MIN_IDENTITY, --min-identity MIN_IDENTITY only monomer alignments with percent identity >= MIN_IDENTITY are printed (by default MIN_IDENTITY=0%)
-s SCORING, --scoring SCORING set scoring scheme for StringDecomposer in the format "insertion,deletion,mismatch,match" (by default "-1,-1,-1,1")
-b BATCH_SIZE, --batch-size BATCH_SIZE set size of the batch in parallelization (by default 5000)
--second-best StringDecomposer will generate <second-best-monomer>, <second-best-monomer-identity>, <reliability> and _homo_-related columns (not recommended when running StringDecomposer of a large number of monomers)
- Remove building with Address Sanitizer by default
- git hash is disabled to enable execution outside of git repo
- CI support via github actions
- improved build and installation
- removal of unnecessary dependencies
- py module of StringDecomposer saves commit hash and has a logger
- initial StringDecomposer release
- conda support
- results of StringDecomposer monomer annotation for available centromere assemblies and ONT and Hifi reads of cen6, cen8, and cenX can be found at Figshare
The String Decomposition Problem and its Applications to Centromere Analysis and Assembly. Tatiana Dvorkina, Andrey V. Bzikadze, Pavel A. Pevzner Bioinformatics, Volume 36, Issue Supplement_1, July 2020, Pages i93–i101; doi: https://doi.org/10.1093/bioinformatics/btaa454
In case of any issues please use issue tracker or email directly to t.dvorkina@spbu.ru