lcskpp: A C++ repository from fpavetic

LCSk++: Practical similarity metric for long strings

This is an implementation of the LCSk++ metric for long strings described in [1].

LCSk++ of two strings a and b calculates the longest common subsequence of two strings with the restriction that the consecutive runs of indices in both strings have length at least k, which is a parameter of the algorithm. For example: longest common subsequence of the strings ABCDAB and ABCADB is of length 5 (ABCDB), while LCSk++ of these two strings with k=3 is 3 (ABC). This restriction loses some matches, but allows for a faster computation.

NOTE: The algorithms for the problem have advanced since this project. It is recommended to visit https://github.com/google/fast-simple-lcsk/ to obtain a better implementation.

Implementation

lcskpp.h/lcskpp.cpp

Implementation of several algorithms for computing LCSk++. The approach described in Section 3.2 of [1] can be found in the lcskpp_sparse_fast function.
fenwick.h

Implementation of the Fenwick tree data structure used by the lcskpp_sparse_fast.
test_lcskpp.cpp

A unit test for the algorithm.
random_strings.h

Functions for generating random strings as described in Section 4.1 of [1].

Dependencies

For compiling this library, it is necessary to have C++11 compatible compiler.

References

[1] Filip Pavetic, Goran Zuzic, Mile Sikic: LCSk++: Practical similarity metric for long strings, http://arxiv.org/abs/1407.2407
[2] Gary Benson, Avivit Levy, Riva Shalom: Longest Common Subsequence in k-length substrings, http://arxiv.org/abs/1402.2097
[3] Sebastian Deorowicz, Szymon Grabowski: Efficient algorithms for the longest common subsequence in k-length substrings, http://arxiv.org/abs/1311.4552

NOTE: [1] has been created as a continuation of the first authors Master Thesis, written on the Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia.

fpavetic/lcskpp

LCSk++: Practical similarity metric for long strings

Implementation

Dependencies

References