/egsa

Generalized enhanced suffix array construction in external memory [CPM'13, AMB 2017]

Primary LanguageCGNU General Public License v3.0GPL-3.0

egsa tool

This software is the implementation of eGSA [1, 2], an external memory algorithm to construct generalized enhanced suffix arrays.

Given a collection of K strings, eGSA outputs the:

  • Generalized suffix array
  • LCP-array
  • Burrows-Wheeler transform (optional)

install

git clone https://github.com/felipelouza/egsa.git
cd egsa
make

run

Given a collection of K strings concatenated in a single file FILE.

./egsa [options] FILE K

Available options:

-h    this help message
-b    output BWT (ext: .bwt)
-c    check SA and LCP
-v    verbose output

Notes

  • K=0 gives all strings as input.
  • Supported extensions are: .txt, .fasta and .fastq.
  • Strings are separated per '\0' (new line) in .txt.

output

The generalized enhanced suffix array is stored in a binary file in the same directory of FILE:

FILE.K.gesa

The generalized BWT (option -b) is stored in a textual file in the same directory of FILE:

FILE.K.bwt

Notes

  • The output binary FILE.K.gesa can be read as a struct type according to the typedef in: lib/utils.h

  • Temporary files are stored in subfolders:

partition/
tmp/

options

Output the BWT inside GESA structure, that is GESA[i] = (text, suffix, lcp, bwt):

make clean
make compile BWT=1

One can inform the maximum available internal memory to be used (in MB):

make clean
make compile MEMLIMIT=10

debug

To see a more detailed execution output use:

make clean
make compile DEBUG=1

external resources

We have included the source codes of the following algorithms:

  • gSACA-K: generalized suffix array construction algorithm [3]
  • lcp-PHI: LCP-array construction algorithm given the suffix array [4]

citation

Please, if you use egsa tool in an academic setting cite the following paper [1]:

@article{Louza2017d,
	author = {Louza, Felipe A. and Telles, Guilherme P. and Hoffmann, Steve and Ciferri, Cristina D. A.},
	title={Generalized enhanced suffix array construction in external memory},
	journal={Algorithms for Molecular Biology},
	year = {2017},
	pages = {26:1--26:16},
	doi = {10.1186/s13015-017-0117-9}
}

references

[1] Louza, F. A., Telles, G. P., Hoffmann, S., Ciferri, C. D. A. (2017). Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology 12(1): 26:1-26:16. link

[2] Louza, F. A., Telles, G. P., Ciferri, C. D. A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In Proc. CPM (pp. 201-210). link

[3] Louza, F. A., Gog, S., Telles, G. P. (2016). Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678: 22-39, github

[4] Kärkkäinen, J., Manzini, G., & Puglisi, S. J. (2009). Permuted Longest-Common-Prefix Array. In G. Kucherov & E. Ukkonen (Eds.), Proc. CPM (Vol. 5577, pp. 181–192).

thanks

Thanks to Fabio Garofalo, Giovanna Rosone, Giovanni Manzini, Nicola Prezza and Guilherme P. Telles by helpful suggestions and debugging.