A repository for generating strobemers and evaluation.
First off, this is a prototype implementation created for the analysis in the preprint describing strobemers. As Python is inefficient with string manipulation, any real implementation and usage of strobemers should probably happen in a low level language. However, the Python code is sufficient as proof of concept.
Strobemers are currently being implemented in compiled languages
The repository consists of a library and a tool StrobeMap
. The library indexing.py
contains functions and generators for creating the datastructures used in the evaluation of the preprint. The tool StrobeMap
is a program which roughly has the same interface as MUMmer
. StrobeMap
takes a reference and queries file in fasta or fastq format. It produces NAMs (Non-overlapping Approximate Matches) between the queries and references and outputs them in a format simular to nucmer/MUMmer. See preprint for definition of NAMs.
The indexing.py
module located in the modules
folder contains functions for generating k-mers, spaced k-mers, minimizers, and strobemers (minstrobes, hybridstrobes, and randstrobes) of order 2 and 3. For randstrobes, there are two ways to create them. The first way is with the function randstrobes
, which takes a string, k-mer size, and upper and lower window limits and returns a dictionary with positions of the strobes as keys and the hash value of the randstrobe sequence (strings) as values. For example
from modules import indexing
all_mers = defaultdict(list)
for (p1,p2,p3), h in indexing.randstrobes(seq, k_size, w_min, w_max, order = 3).items():
# all_mers is a dictionary with hash values as keys and
# a list with position-tuples of where the strobemer is sampled from
all_mers[h].append( (p1,p2,p3) )
Functions minstrobes
and hybridstrobes
have the same interface.
The second way is to call randstrobes_iter
which is a generator. Similarly to randstrobes
, randstrobes_iter
takes a string, k-mer size, and upper and lower window size, but instead yields randstrobes from the sequence and is not as memmory requiring as the randstrobes
function which store and returns all the strobes in a dictionary. randstrobes_iter
generating randpers of order 2 can be used as follows
from modules import indexing
all_mers = defaultdict(list)
for (p1,p2), s in indexing.randstrobes_iter(seq, k_size, w_min, w_max, order = 2, buffer_size = 1000000):
all_mers[s].append( (p1,p2) )
Functions minstrobes_iter
and hybridstrobes_iter
have the same interface.
StrobeMap
implements order 2 and 3 hybridstrobes (default), randstrobes, minstrobes, as well as kmers. The tool produces NAMs (Non-overlapping Approximate Matches; see explanation in preprint) for both strobemers and kmers. Test data is found in the folder data
in this repository.
Here are some example uses:
# Generate hybridstrobe matches (hybridstrobe parametrization (2,15,20,70))
# between ONT SIRV reads and the true reference sequences
./StrobeMap --queries data/sirv_transcripts.fasta \
--references data/ONT_sirv_cDNA_seqs.fasta \
--outfolder strobemer_output/ --k 15
--strobe_w_min_offset 20 --strobe_w_max_offset 70
# Generate kmer matches (k=30)
# between ONT SIRV reads and the true reference sequences
./StrobeMap --queries data/sirv_transcripts.fasta \
--references data/ONT_sirv_cDNA_seqs.fasta \
--outfolder kmer_output/ --k 30 --kmer_index
# Reads vs reads matching using randstrobes
./StrobeMap --queries data/ONT_sirv_cDNA_seqs.fasta \
--references data/ONT_sirv_cDNA_seqs.fasta \
--outfolder strobemer_output/ --k 15 \
--strobe_w_min_offset 20 --strobe_w_max_offset 70 \
--randstrobe_index
Minstrobes has the same parameters as hybridstrobes and randstrobes but are invoked with parameter --minstrobe_index
The output is a file matches.tsv
in the output folder. You can se a custom outfile name with the parameter --prefix
.
Output format is a tab separated file on the same format as MUMmer, with identical fields except the last one which is approximate reference sequence match length instead of what MUMmer produce:
>query_accession
ref_id ref_pos query_pos match_length_on_reference
Small example output from aligning sirv reads to transcripts (from the commands above) which also highlights the stobemers strength compared to kmers. While kmers can give a more nuanced differentiation (compare read hits to SIRV606
and SIRV616
) both the sequences are good candidates for downstream processing. In this small example, the strobemers produce fewer hits/less output needed for post clustering of matches, e.g., for downstream clustering/alignment/mapping. Notice that randstobe hit positions are currently not deterministic due to hash seed is set at each new pyhon instantiation. I will fix the hash seed in future implementations.
Randstrobes (2,15,20,70)
>41:650|d00e6247-9de6-485c-9b44-806023c51f13
SIRV606 35 92 487
SIRV616 35 92 473
>56:954|a23755a1-d138-489e-8efb-f119e679daf4
SIRV509 3 3 515
SIRV509 520 529 214
SIRV509 762 767 121
>106:777|0f79c12f-efed-4548-8fcc-49657f97a126
SIRV404 53 131 535
kmers (k=30)
>41:650|d00e6247-9de6-485c-9b44-806023c51f13
SIRV606 33 90 46
SIRV606 92 150 125
SIRV606 219 275 81
SIRV606 349 408 70
SIRV606 420 479 47
SIRV606 481 540 42
SIRV616 33 90 46
SIRV616 92 150 125
SIRV616 219 275 81
SIRV616 349 408 60
SIRV616 409 482 44
SIRV616 467 540 42
>56:954|a23755a1-d138-489e-8efb-f119e679daf4
SIRV509 68 72 141
SIRV509 230 233 100
SIRV509 331 335 105
SIRV509 435 442 40
SIRV509 475 483 36
SIRV509 579 585 41
SIRV509 621 627 46
SIRV509 695 701 44
SIRV509 812 815 53
>106:777|0f79c12f-efed-4548-8fcc-49657f97a126
SIRV404 53 131 58
SIRV404 128 208 127
SIRV404 283 364 30
SIRV404 422 494 142
Kristoffer Sahlin, Strobemers: an alternative to k-mers for sequence comparison, bioRxiv 2021.01.28.428549; doi: https://doi.org/10.1101/2021.01.28.428549
Preprint found here