compact-superstring

Use a greedy approximation algorithm to find the shortest common superstring of given strings in succinct space. The algorithm is described in the arXiv paper. The implementation assumes a byte alphabet.

Build/Runtime Requirements

Reasonably new compilers for C and C++, e.g. GCC 6 or Clang 3.7. C++14 support is required.
GNU gengetopt 2.22.6.

On Linux also the following libraries are required to build and run a tool to verify the constructed superstring:

Build Requirements

CMake
Boost
Python2 to build the verification tool for Linux.

Building

Short version

git clone https://github.com/tsnorri/compact-superstring.git
cd compact-superstring
git submodule update --init
touch local.mk
make -j4 BUILD_STYLE=release

Long version

Clone the repository with git clone https://github.com/tsnorri/compact-superstring.git.
Change the working directory with cd compact-superstring.
Run git submodule update --init. This clones the missing submodules and updates their working tree.
Edit local.mk in the repository root to override build variables. Useful variables include CC and CXX for C and C++ compilers respectively, LOCAL_CXXFLAGS and LOCAL_LDFLAGS for C++ and linker flags respectively and BOOST_IOSTREAMS_LIB for the command line options to link Boost's Iostreams library. See common.mk for additional variables.
Run make with a suitable numer of parallel jobs and the variable BUILD_STYLE defined, e.g. make -j4 BUILD_STYLE=release

Useful make targets include:

all: Build everything
clean: Remove build products except for dependencies (in the `lib` folder).
clean-all: Remove all build products.

Running

Please see run.sh for examples.

The tool tribble/find-superstring/find-superstring takes a FASTA file as input and generates an index. The index may then be used to generate the superstring.

The tool tribble/verify-superstring/verify-superstring takes the superstring generated by find-superstring as input and builds another index. This index may then be used to check that all the reads in the original FASTA input file are substrings of the superstring.

See tribble/find-superstring/find-superstring --help and tribble/verify-superstring/verify-superstring --help for command line options and examples.

Disclaimer

The implementation differs from the one described in the arXiv paper in the preprocessing stage where it sorts the input strings and removes duplicates. For simplicity, we used the C++ standard library sorting. If there is a huge number of duplicates in the data, this might take O(n log n) bits of space. If your dataset contains a huge number of duplicates, we suggest you remove those before running the algorithm.

algbio/GreedySuperstring