FMSI is a memory-efficient tool for querying
Note that currently the tool is still in development and especially the set operations are quite time-consuming.
It is provided under the MIT license (see LICENSE file).
If you use FMSI in your research, please cite the following.
Ondřej Sladký, Pavel Veselý, and Karel Břinda: Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k-Mer Sets. bioRxiv 2024.03.06.583483, 2024. https://doi.org/10.1101/2024.03.06.583483
@article {sladky2024-f-masked-superstrings,
author = {Ond{\v r}ej Sladk{\'y} and Pavel Vesel{\'y} and Karel B{\v r}inda},
title = {Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for 𝑘-Mer Sets},
elocation-id = {2024.03.06.583483},
year = {2024},
doi = {10.1101/2024.03.06.583483},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/03/11/2024.03.06.583483},
eprint = {https://www.biorxiv.org/content/early/2024/03/11/2024.03.06.583483.full.pdf},
journal = {bioRxiv}
}
Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717
@article{sladky2023-masked-superstrings,
title = { Masked superstrings as a unified framework for textual $k$-mer set representations },
author = { Sladk{\'y}, Ond{\v r}ej and Vesel{\'y}, Pavel and B{\v r}inda, Karel },
journal = { bioRxiv },
volume = { 2023.02.01.526717 },
year = { 2023 },
doi = { 10.1101/2023.02.01.526717 }
}
First clone the repo and its dependencies:
git clone --recursive git@github.com:OndrejSladky/fmsi.git
Compile the program by running make
.
If you wish to perform set operations or answer membership queries without the need to understand the
details of the
- Compute a masked superstring.
- This can be done by KmerCamel🐫; simply run
kmercamel -c -k 31 -p fasta.fa -o ms.fa
(with appropriate values for-k
and-p
). - If you obtained the masked superstring in a different way, make sure it minimizes the number of ones; if you're unsure, you can use
kmercamel optimize -c -a zeros -k 31 -p ms_more_ones.fa -o ms.fa
. No need to optimize superstrings directly computed by KmerCamel🐫.
- This can be done by KmerCamel🐫; simply run
- Index the masked superstring with
./fmsi index -p ms.fa
. - Perform the set operations you wish with
./fmsi [operation] -k 31 -p ms1.fa -p ms2.fa -r ms.fa
. Possible operations (use the names instead of[operation]
) are:union
to compute the union.inter
to compute the intersectiondiff
to compute the set differencesymdiff
to compute the symmetric difference
- Query the index with
./fmsi query -p ms.fa -q query.fa -k 31
. - To get back the underlying masked superstring, use
./fmsi export -p ms.fa
.
If you use FMSI this way, it ensures that operations in any order and queries to any index are computed correctly,
while keeping the memory usage for queries as low as possible. Furthermore, exported
The only downside to this approach is that each set operation uses compaction, which is the most time- and memory- consuming part of the process, which in some use cases might cause slowdowns which are not necessary. If this is your case, you probably want to stick to the advanced usage, managing the functions and building block methods yourself.
If you wish to manage the operations yourself, the workflow is quite similar to the basic usage, with the following changes:
- Underlying
$f$ -MS concatenation can be done with./fmsi merge
. Details on which$f$ should be used is described in Chapter 4 of the paper. - Compaction and change back to the or-masked superstring can be done with
./fmsi compact
. - If you query the index with different function than or, use the
-f
argument. The same applies to compaction.
To run the tool, run ./fmsi [command]
The recognized commands are:
-
index
Creates a BWT based index of the given masked superstring. -
query
Queries a$k$ -mer against an index. -
union
Performs union on two indexes. Expects or-MS. -
inter
Performs intersection on two indexes. Expects or-MS. -
diff
Performs set difference on two indexes. Expects or-MS. -
symdiff
Performs symmetric difference on two indexes. Expects or-MS. -
merge
Merge several indexes. -
compact
Compacts an index. -
export
Export the underlying masked superstring. -
clean
Cleans the files stored for index. -
-v
Prints the version of the program. -
-h
Prints the help.
Each command has its own set of arguments, which can be displayed by running ./fmsi [command] -h
.
To run the associated tests, simply run make test
.
Ondřej Sladký - sladky@iuuk.mff.cuni.cz