mKD-GBWT
mKD-GBWT(multiple KD-tree GBWT) is an external memory full-text index.
The basis of mKD-GBWT is Geometric Burrows-Wheeler Transform. The biggest improvement of mKD-GBWT is that it uses multiple KD-tree as its orthogonal range searching data structure, so it has a good I/O performance.
Install
-
First you need install SAscan [1] and LCPscan [2].
[1] https://www.cs.helsinki.fi/group/pads/SAscan.html
[2] https://www.cs.helsinki.fi/group/pads/LCPscan.html -
Download mKD-GBWT index from https://github.com/Hongweihuo-Lab/mKD-GBWT.
Modify the location of index
-
Open file sbt_util.h, modify the
#define GBWT_INDEX_POSITION "/media/软件/GBWT-Index-position/"
into location of yourself. -
Open file tools/sam_sa_lcp.cpp, modify the
#define GBWT_INDEX_POSITION "/media/软件/GBWT-Index-position/"
into location of yourself.$ cd mKD-GBWt $ make $ cd test $ make
Generate index of data
-
$ cd SAscan-??? $ cd src $ ./sascan
-
$ cd LCPscan-??? $ cd build $ ./construct_lcp
-
$ cd mKD-GBWT $ ./gen_sa_lcp
-
$ cd mKD-GBWT $ ./sam_sa_lcp $ ./build
Pattern-matching
$ cd mKD-GBWT
$ ./mygbwt <data> <disk-page-size-in-bytes> <step>
Example
the data file is `/data/english`, disk-page is 4096 bytes, step = 4
First generate the index of data
1. $ cd SAscan-???
$ cd src
$ ./sascan /data/english
2. $ cd LCPscan-???
$ cd build
$ ./construct_lcp /data/english
3. $ cd mKD-GBWT
$ ./gen_sa_lcp /data/english
4. $ cd mKD-GBWT
$ ./sam_sa_lcp /data/english 4
$ ./build /data/english 4096 4
Second, pattern mathcing
$ cd mKD-GBWT
$ ./mygbwt /data/english 4096 4
Contributors
Code
• Yuhao Zhao
Paper
H. Huo, X. Chen, Y. Zhao, X. Zhu, and J. S. Vitter, Practical Succinct Text Indexes in External Memory, Proceedings of the IEEE Data Compression Conference (DCC), Snowbird, USA, March 27 - March 30, 2018, 217–226.