/kmp

Knuth–Morris–Pratt algorithm

Primary LanguagePython

Knuth–Morris–Pratt algorithm

searches for occurrences of a "pattern" P within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
read more

Partial Match Table

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

example:
"abababca":

char: a b a b a b c a
index: 0 1 2 3 4 5 6 7
value: 0 0 1 2 3 4 0 1
  • third cell: first three characters ("aba")

    • prefixes ("a" and "ab")
    • suffixes ("a" and "ba")
    • the proper prefix "a" matches the proper suffix "a" --> value=1
  • cell 4: ("abab")

    • prefixes ("a","ab","aba")
    • suffixes ("b","ab","bab")
    • matches = "ab" --> value=2
  • cell 5: ("ababa")

    • prefixes ("a","ab","aba","abab")
    • suffixes ("a","ba","aba","baba")
    • matches = "aba" --> value=3
  • cell 7: ("abababc")

    • there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will
def kmp_table(s):   
    """
    Partial match table
    O(K): K=length(s) 
    """
    length = len(s)
    partial_match_table = [0]*length
    s = list(s)
    i,j=0,1
    while i<length and j<length:
        if s[i]==s[j]:  # case 1: the substring continues  
            partial_match_table[j]=i+1
            i+=1
            j+=1
        else:  # case 2: the substring doesn't continues, but can fall back
            if i==0:
                partial_match_table[j]=0
                j+=1
            else:  # case 3: run out of candates
                i = partial_match_table[i-1]
    return partial_match_table

s="ababab"
print(kmp_table(s))
# [0, 0, 1, 2, 3, 4]

code visualization

Use Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches.
If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

example:
matching the pattern "abababca" against the text "bacbababaabcbab"


b a c b a b a b a a b c b a b
l
a b a b a b c a
  • partial_match_length=1
  • table[partial_match_length - 1]=table[0]=0
  • don't skip

b a c b a b a b a a b c b a b
l l l l l
a b a b a b c a
  • partial_match_length=5
  • table[partial_match_length - 1]=table[4]=3
  • skip=partial_match_length-table[4]=5-3=1

b a c b a b a b a a b c b a b
x x l l l
a b a b a b c a
  • partial_match_length=3
  • table[partial_match_length - 1]=table[2]=1
  • skip=partial_match_length-table[2]=3-1=2

b a c b a b a b a a b c b a b
x x l
a b a b a b c a
  • pattern is longer than the remaining characters in the text, so we know there’s no match.
def kmp_search(P, S, kmp_table):
    """
    O(n)
    P: the pattern(word) sought
    S: the text to be searched
    return: an integer (the zero-based position in S at which P is found)

    """
    m = 0  # the beginning of the current match in S
    i = 0  # the position of the current character in P
    
    while m + i < len(S):
        if P[i] == S[m+i]:
            if i == len(P) - 1:
                return m
            i += 1
        else:
            if kmp_table[i] > 0:
                m = m + i - kmp_table[i]
                i = kmp_table[i]
            else:
                m += 1
                i = 0
    return None  # have searched all of S unsuccessfully
    
s="ababab"
print(kmp_search('aba', s, kmp_table(s)))
# 0 

code visualization