/br-index-mems

MEM finding on texts and graphs using bidirectional r-index

Primary LanguageC++

br-index-mems

News

Latest version of this tool can be found in https://github.com/algbio/efg-mems.

This repository contains a version described in WABI 2023 paper by Rizzo, Cáceres, Mäkinen (see below).

This is fork of Yuma Arakawa's br-index, adding support for MEM finding:

  • Support for finding maximal exact matches (MEMs) between a set of query patterns and a text
  • Support for finding MEMS between a set of query patterns and an elastic founder graph

About the root repository

The root repository provides the bi-directional r-index (br-index).

The r-index is the compressed text index which supports efficient count(P) and locate(P). Its uses O(r) words of space, where r is the number of equal-letter runs in BWT of the text.

The br-index is achieved by using the mechanism of the r-index but adding new structures. It allows for bi-directional extension of patterns, left-extension and right-extension. Also, you can locate all the occurrences of the current pattern at any step of the search.

System Requirements

  • This project is based on sdsl-lite library. Install sdsl-lite beforehand and modify variables SDSL_INCLUDE and SDSL_LIB in CMakeLists.txt.

  • This project has been tested under gcc 4.8.5 and gcc 7.5.0.

How to Use

Firstly, clone the repository. Since a submodule is used (iutest), recursive cloning is necessary.

git clone --recursive https://github.com/algbio/br-index-mems.git

In order to build, execute following commands: (This project is using CMake)

mkdir build
cd build
cmake ..
make

8 executables will be created in the build directory.

bri-build
Builds the br-index on the input text file.
bri-locate
Locates the occurrences of the given pattern using the index. Provide a pattern file in the Pizza&Chili format. You can give an option "-m (number)" for the number of mismatched characters allowed (0 by default).
bri-count
Counts the number of the occurrences of the given pattern using the index. Its usage is same as bri-locate.
bri-seedex
Applies the seed-and-extend approach to the given pattern. Exactly matches the core region and extends with some mismatches.
bri-space
Shows the statistics of the text and the breakdown of the index space usage.
bri-mem
Finds MEMs between a collection of patterns and a text.
efg-mems
Finds MEMs between a collection of patterns and an elastic founder graph.
run_tests
runs unit tests.

You can run unit tests by

make test-bri

For MEM finding, you can test the following:

mkdir inputs
mkdir outputs
cp ../patterns.txt inputs/
cp ../text.txt inputs/
./br-build inputs/patterns.txt
./br-build inputs/text.txt
./bri-mem -k 4 -o outputs/MEMs.txt inputs/text.txt.bri inputs/patterns.txt.bri
cat outputs/MEMs.txt
This should print MEMs x,i,d s.t. P[i..i+d-1]=T[x..x+d-1] and d>=4.
In our case (note 0-based indexing):
15,2,6
65,2,6
31,18,4
25,1,10
12,3,4
60,1,6

Versions (from root repository)

br_index.hpp (default)
The simple implementation of br-index used in the experiments. Only the variables necessary for locate (j,d,len) are maintained, which are sufficient to compute locate.
br_index_nplcp.hpp
The implementation without PLCP. (p,j,d,len) are maintained. It computes locate by calculating p'=LF^d(p) and comparing p' with [s, e]. Larger than the normal one while computing locate is faster when occ is large compared to |P|.
br_index_naive.hpp (old)
The naive implementation of br-index. All the variables p,j,d,pR,jR,dR,len are maintained during the search. Not space-efficient, implemented mainly for the educational purpose and the possible future use. (It's not updated now, so it doesn't function)

Citation

Cite the following paper for br-index:

  • Arakawa, Y., Navarro, G., & Sadakane, K. (2022). Bi-Directional r-Indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

It is more desirable to cite the following papers in addition, which are the original papers of the r-index:

  • Gagie, T., Navarro, G., & Prezza, N. (2018). Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1459-1477). Society for Industrial and Applied Mathematics.
  • Gagie, T., Navarro, G., & Prezza, N. (2020). Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM (JACM), 67(1), 1-54.

MEM finding is analogous to Algorithm 11.3 at page 226 in

  • V. Mäkinen, D. Belazzougui, F. Cunial, A. I. Tomescu: Genome-Scale Algorithm Design. Cambridge University Press, 2015.
  • See also: D. Belazzougui, F. Cunial, J. Kärkkäinen, V. Mäkinen: Linear-time String Indexing and Analysis in Small Space. ACM Trans. Algorithms 16(2): 17:1-17:54 (2020)

MEM finding on EFGs is described in

  • Nicola Rizzo, Manuel Cáceres, Veli Mäkinen: Finding Maximal Exact Matches in Graphs. In Proc. WABI 2023, LIPIcs 273, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 10:1-10:17, 2023.
  • Current version implements only MEM finding on paths of length at most 3 nodes, so if k>min_{paths P of 3 nodes} |label(P)|, some MEMs may not be reported