/BWTIL_mod

Primary LanguageJupyter Notebook

BWTIL : the BWT Text-indexing Library

Authors: Nicola Prezza, Nicola Gigante, and Alberto Policriti. mail: nicolapr@gmail.com

Brief introduction

The BWTIL library offers a set of tools, classes and functions designed to facilitate operations that involve manipulating the Burrows-Wheeler text transform (example: computing/inverting the BWT or checking its correctness) and indexes based on the BWT. The library is specifically designed to deal with large files (which means up to 2^64 characters) and implements efficient algorithms and data structures to offer good time-and-space performances. BWTIL comes with some sample programs and classes, among which you can find static and dynamic bitvectors, Huffman-compressed dynamic strings, partial sums, Huffman trees, static wavelet trees, the dB-hash data structure (ISAAC2014 paper: http://link.springer.com/chapter/10.1007%2F978-3-319-13075-0_13), a succinct wavelet-tree based FM index, and a novel BWT construction algorithm (LATA2015 paper: http://link.springer.com/chapter/10.1007/978-3-319-15579-1_46). The available executables (to date) are:

Download

Since BWTIL includes extern git repositories as submodules, clone it using the --recursive option:

git clone --recursive https://github.com/nicolaprezza/BWTIL

Compile

The library has been extensively tested under linux using gcc 4.8.2 and clang++ 3.5. We use cmake to generate the Makefile:

To compile, firstly create and enter the bin/ directory

mkdir bin; cd bin

Then, launch cmake as (default build type is release):

cmake ..

Finally, build the executables:

make

The above command creates the executables in the bin directory.

Input formats

  • plain text files: ASCII-coded. However, the byte 0x0 must NOT appear inside the text since the algorithms use 0x0 as text terminator.
  • bwt files: ASCII-coded, with a UNIQUE 0x0 byte (terminator character) appearing somewhere inside the text. Be aware that, if the input bwt file is malformed, the programs will fail with a error message.
  • suffix array files: each SA address is stored as a 64-bit (8 byte) integer. Despite the fact that this coding may not be optimal for small texts, you can compress the SA files to reduce its size.