Latest version of this tool can be found in https://github.com/algbio/efg-mems.
This repository contains a version described in WABI 2023 paper by Rizzo, Cáceres, Mäkinen (see below).
This is fork of Yuma Arakawa's br-index, adding support for MEM finding:
- Support for finding maximal exact matches (MEMs) between a set of query patterns and a text
- Support for finding MEMS between a set of query patterns and an elastic founder graph
The root repository provides the bi-directional r-index (br-index).
The r-index is the compressed text index which supports efficient count(P) and locate(P). Its uses O(r) words of space, where r is the number of equal-letter runs in BWT of the text.
The br-index is achieved by using the mechanism of the r-index but adding new structures. It allows for bi-directional extension of patterns, left-extension and right-extension. Also, you can locate all the occurrences of the current pattern at any step of the search.
-
This project is based on sdsl-lite library. Install sdsl-lite beforehand and modify variables SDSL_INCLUDE and SDSL_LIB in CMakeLists.txt.
-
This project has been tested under gcc 4.8.5 and gcc 7.5.0.
Firstly, clone the repository. Since a submodule is used (iutest), recursive cloning is necessary.
git clone --recursive https://github.com/algbio/br-index-mems.git
In order to build, execute following commands: (This project is using CMake)
mkdir build
cd build
cmake ..
make
8 executables will be created in the build directory.
- bri-build
- Builds the br-index on the input text file.
- bri-locate
- Locates the occurrences of the given pattern using the index. Provide a pattern file in the Pizza&Chili format. You can give an option "-m (number)" for the number of mismatched characters allowed (0 by default).
- bri-count
- Counts the number of the occurrences of the given pattern using the index. Its usage is same as bri-locate.
- bri-seedex
- Applies the seed-and-extend approach to the given pattern. Exactly matches the core region and extends with some mismatches.
- bri-space
- Shows the statistics of the text and the breakdown of the index space usage.
- bri-mem
- Finds MEMs between a collection of patterns and a text.
- efg-mems
- Finds MEMs between a collection of patterns and an elastic founder graph.
- run_tests
- runs unit tests.
You can run unit tests by
make test-bri
For MEM finding, you can test the following:
mkdir inputs
mkdir outputs
cp ../patterns.txt inputs/
cp ../text.txt inputs/
./br-build inputs/patterns.txt
./br-build inputs/text.txt
./bri-mem -k 4 -o outputs/MEMs.txt inputs/text.txt.bri inputs/patterns.txt.bri
cat outputs/MEMs.txt
- br_index.hpp (default)
- The simple implementation of br-index used in the experiments. Only the variables necessary for locate (j,d,len) are maintained, which are sufficient to compute locate.
- br_index_nplcp.hpp
- The implementation without PLCP. (p,j,d,len) are maintained. It computes locate by calculating p'=LF^d(p) and comparing p' with [s, e]. Larger than the normal one while computing locate is faster when occ is large compared to |P|.
- br_index_naive.hpp (old)
- The naive implementation of br-index. All the variables p,j,d,pR,jR,dR,len are maintained during the search. Not space-efficient, implemented mainly for the educational purpose and the possible future use. (It's not updated now, so it doesn't function)
Cite the following paper for br-index:
- Arakawa, Y., Navarro, G., & Sadakane, K. (2022). Bi-Directional r-Indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
It is more desirable to cite the following papers in addition, which are the original papers of the r-index:
- Gagie, T., Navarro, G., & Prezza, N. (2018). Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1459-1477). Society for Industrial and Applied Mathematics.
- Gagie, T., Navarro, G., & Prezza, N. (2020). Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM (JACM), 67(1), 1-54.
MEM finding is analogous to Algorithm 11.3 at page 226 in
- V. Mäkinen, D. Belazzougui, F. Cunial, A. I. Tomescu: Genome-Scale Algorithm Design. Cambridge University Press, 2015.
- See also: D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen: Linear-time String Indexing and Analysis in Small Space. ACM Trans. Algorithms 16(2): 17:1-17:54 (2020)
MEM finding on EFGs is described in
- Nicola Rizzo, Manuel Cáceres, Veli Mäkinen: Finding Maximal Exact Matches in Graphs. In Proc. WABI 2023, LIPIcs 273, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 10:1-10:17, 2023.
- Current version implements only MEM finding on paths of length at most 3 nodes, so if k>min_{paths P of 3 nodes} |label(P)|, some MEMs may not be reported