This is an implementation of the LCSk++ metric for long strings described in [1].
LCSk++ of two strings a and b calculates the longest common subsequence of two strings with the restriction that the consecutive runs of indices in both strings have length at least k, which is a parameter of the algorithm. For example: longest common subsequence of the strings ABCDAB and ABCADB is of length 5 (ABCDB), while LCSk++ of these two strings with k=3 is 3 (ABC). This restriction loses some matches, but allows for a faster computation.
NOTE: The algorithms for the problem have advanced since this project. It is recommended to visit https://github.com/google/fast-simple-lcsk/ to obtain a better implementation.
- lcskpp.h/lcskpp.cpp
Implementation of several algorithms for computing LCSk++. The approach described in Section 3.2 of [1] can be found in the lcskpp_sparse_fast function.
- fenwick.h
Implementation of the Fenwick tree data structure used by the lcskpp_sparse_fast.
- test_lcskpp.cpp
A unit test for the algorithm.
- random_strings.h
Functions for generating random strings as described in Section 4.1 of [1].
For compiling this library, it is necessary to have C++11 compatible compiler.
[1] Filip Pavetic, Goran Zuzic, Mile Sikic: LCSk++: Practical similarity metric for long strings, http://arxiv.org/abs/1407.2407
[2] Gary Benson, Avivit Levy, Riva Shalom: Longest Common Subsequence in k-length substrings, http://arxiv.org/abs/1402.2097
[3] Sebastian Deorowicz, Szymon Grabowski: Efficient algorithms for the longest common subsequence in k-length substrings, http://arxiv.org/abs/1311.4552
NOTE: [1] has been created as a continuation of the first authors Master Thesis, written on the Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia.