/suffix_array

suffix array construction and searching algorithms for in-memory binary data.

Primary LanguageRustMIT LicenseMIT

Suffix array

Suffix array construction and searching algorithms for in-memory binary data.

To index plain texts, burntsushi's suffix featuring utf-8 support is a better choice.

This crate uses the Amos Wenger's C bindings to Yuta Mori's dissufsort to construct suffix array, which is the fastest known suffix array construction algorithm (SACA) running in single thread that uses merely O(1) additional workspace.

In addition, I have implemented the space-efficient parallel SACA pSACAK in a separate crate. For now, it has not been thoroughly benchmarked as well as optimized.

TODO

  • Benchmark using Pizza&Chili Corpus.

  • Rewrite the benchmarks.

  • Speed up searching by LCP array, a.k.a. construct enhanced suffix array (ESA). There are two major classes of LCP array construction algorithms (LACA): The first class produces LCP array from suffix array. AFAIK, these LACAs have to allocate additional auxiliary arrays (such as ISA, PLCP, and BWT) to construct the LCP array. The second class constructs the suffix array together with its LCP array. They are fast and space-efficient but require sophisticated modifications on kind of SACAs based on induce copying (such as sais-lite, saca-k, and divsufsort). This crate would not provide ESA support until I figure out a proper way to implement it.

  • Speed up searching by bucket pointers, inspired by this paper (which uses a trie as index, interleaved arrays and text fingerprints to improve locality for ESA searching on disk). See SuffixArray::enable_buckets().

  • Add compressed suffix array (CSA) support. CSA safcrificed the speed quite a lot to gain some space efficiency, which is not that necessary.

  • Serialization/deserialization. Enable the optional pack feature to use those APIs. This feature is based on Paul Masurel's bitpacking.

  • Rewrite suffix array construction algorithm. Currently, this crate uses dissufsort to construct all the suffix arrays.

  • Look into other one's efforts on improving libdivsufsort and multikey quick sort: 1, 2, and 3.