Use a greedy approximation algorithm to find the shortest common superstring of given strings in succinct space. The algorithm is described in the arXiv paper. The implementation assumes a byte alphabet.
- Reasonably new compilers for C and C++, e.g. GCC 6 or Clang 3.7. C++14 support is required.
- GNU gengetopt 2.22.6.
On Linux also the following libraries are required to build and run a tool to verify the constructed superstring:
git clone https://github.com/tsnorri/compact-superstring.git
cd compact-superstring
git submodule update --init
touch local.mk
make -j4 BUILD_STYLE=release
- Clone the repository with
git clone https://github.com/tsnorri/compact-superstring.git
. - Change the working directory with
cd compact-superstring
. - Run
git submodule update --init
. This clones the missing submodules and updates their working tree. - Edit
local.mk
in the repository root to override build variables. Useful variables includeCC
andCXX
for C and C++ compilers respectively,LOCAL_CXXFLAGS
andLOCAL_LDFLAGS
for C++ and linker flags respectively andBOOST_IOSTREAMS_LIB
for the command line options to link Boost's Iostreams library. Seecommon.mk
for additional variables. - Run make with a suitable numer of parallel jobs and the variable
BUILD_STYLE
defined, e.g.make -j4 BUILD_STYLE=release
Useful make targets include:
- all
- Build everything
- clean
- Remove build products except for dependencies (in the `lib` folder).
- clean-all
- Remove all build products.
Please see run.sh
for examples.
The tool tribble/find-superstring/find-superstring
takes a FASTA file as input and generates an index. The index may then be used to generate the superstring.
The tool tribble/verify-superstring/verify-superstring
takes the superstring generated by find-superstring
as input and builds another index. This index may then be used to check that all the reads in the original FASTA input file are substrings of the superstring.
See tribble/find-superstring/find-superstring --help
and tribble/verify-superstring/verify-superstring --help
for command line options and examples.
The implementation differs from the one described in the arXiv paper in the preprocessing stage where it sorts the input strings and removes duplicates. For simplicity, we used the C++ standard library sorting. If there is a huge number of duplicates in the data, this might take O(n log n) bits of space. If your dataset contains a huge number of duplicates, we suggest you remove those before running the algorithm.