A fast algorithm for the all-pairs suffix–prefix problem [Theoretical Computer Science 2017]

Jihyuk Lim and Kunsoo Park

Introduction

All-pairs suffix-prefix matching is an important part of DNA sequence assembly where it is the most time-consuming part of the whole assembly. Although there are algorithms for all-pairs suffix-prefix matching which are optimal in the asymptotic time complexity, they are slower than SOF and Readjoiner which are state-of-the-art algorithms used in practice. In this paper we present an algorithm for all-pairs suffix-prefix matching that uses a simple data structure for storing input strings and advanced algorithmic techniques for matching, which together lead to fast running time in practice. Our algorithm is much faster than SOF and Readjoiner in real datasets and random datasets.

Datasets

We compare the performances of SOF, Readjoiner, and our algorithm for the whole process of APSP matching. The labels of Our1 and Our2 mean our algorithm with option output = 2 and output = 3, respectively. For SOF, option −o 2 is used. Our2, SOF with −o 2 and Readjoiner return the same output, which includes all overlaps for each pair of input strings. Our1 solves the problem of APSP matching as it is defined (i.e., it returns the longest overlap for each pair of input strings). In our algorithm we set c1 to 16 and c2 to 8 in all experiments.

We used two types of datasets which are real and random. The five real datasets are the complete EST databases of Citrus clementina, Citrus sinensis, Citrus trifoliata, C. elegans, and Atta cephaloates.

Citrus clementina (Original link)
Citrus sinensis (Original link)
Citrus trifoliata (Original link)
C. elegans (Original link)
Atta cephaloates (Original link)

The random datasets are generated by a program [4] that gives k strings such that the lengths of the strings follow a normal distribution with mean z and standard deviation σ and the characters of the strings follow a uniform distribution over the alphabet, where k, z and σ are parameters given by the user. The alphabet is again {A,C, G, T}. We generated two datasets rnd1 and rnd2, where rnd1 has 300000, 1000, and 150 as k, z, and σ, respectively, and rnd2 has 1000000, 500, and 100.

Source code information

This source code is an implementation of [1], a new algorithm for APSP matching.

The source code is implemented based on SOF [2]. (SOF is available at http://confluence.qu.edu.qa/download/attachments/9240580/Prefix.tgz)

compile

make

run

To run a test with datasets/all_ests.plain, output type output=2, the number of threads p=1, and overlap minimum om=15 and to write the results to output_all_ests.

./FastAPSP datasets/all_ests.plain -output 2 -p 1 -om 15 > output_all_ests

references

[1] Jihyuk Lim and Kunsoo Park, Algorithm Engineering for All-Pairs Suffix-Prefix Matching, SEA 2017.

[2] M.H. Rachid and Q. Malluhi, A Practical and Scalable Tool to Find Overlaps betweeen Sequences, BioMed Res. Int. 2015, 1-12, 2015

[3] Jihyuk Lim and Kunsoo Park, A Fast Algorithm for the All-Pairs Suffix-Prefix Problem, Theoretical Computer Science, "https://doi.org/10.1016/j.tcs.2017.07.013"

[4] http://theory.snu.ac.kr/?wpdmpro=random_gen