/SuperFastMatchOld

Finds the full set of longest common substrings in two input strings

Primary LanguagePython

Introduction

Needed a fast algorithm to find the full set of longest common substrings, along with their start and end position in the two input strings. Will be used in the Churnalisted project for finding cases of copy and paste from press release to news article and then highlighting the exact passages to the site’s visitor.

The hashing code section is completely ripped from:

http://code.google.com/p/py-rollingsuperfasthash/

which in turn derives from:

http://www.azillionmonkeys.com/qed/hash.html

The searching code section is inspired by:

http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

and the filtering code section is inspired by, but not based on:

http://en.wikipedia.org/wiki/Interval_tree

Installation

Easiest:

pip install git+http://github.com/donovanhide/SuperFastMatch.git

or

git clone http://github.com/donovanhide/SuperFastMatch.git
cd SuperFastMatch
python setup.py install
python test.py

Using

import superfastmatch
s1 = "I'm a test with a section that will be repeated repeated testingly testingly"
s2 = "I will be repeated repeated testingly 123 testingly"
matches = superfastmatch.superfastmatch(s1,s2,6)

where 6 is the minimum substring length to search for and matches is a nested tuple with the following format:

(hash,start,end,length,[(hash,start,end,length),(hash,start,end,length),.....])

Limitations

The hashing code processes the input strings byte by byte. If they were to be unicode compatible, for the start and end points to be correct, then the C code would have to use UTF-32 internally and divide all positions by 4. This seemed like it might be slower and quite complicated. The simple hack is to encode all unicode strings as Latin-1 and map any >255 characters to a ‘?’ using the replace flag. It also makes sense to convert the inputs into all lower case. For example:

def test_short_unicode(self):
s1 = u"This is “Unicode”"
s2 = u"unicode I am"
matches = superfastmatch.superfastmatch(s1.lower().encode('latin-1','replace'),s2.lower().encode('latin-1','replace'),2)

Perhaps this conversion belongs in the code itself. Patches welcome!