searches for occurrences of a "pattern" P within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
read more
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
example:
"abababca":
char: | a | b | a | b | a | b | c | a |
---|---|---|---|---|---|---|---|---|
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
-
third cell: first three characters ("aba")
- prefixes ("a" and "ab")
- suffixes ("a" and "ba")
- the proper prefix "a" matches the proper suffix "a" --> value=1
-
cell 4: ("abab")
- prefixes ("a","ab","aba")
- suffixes ("b","ab","bab")
- matches = "ab" --> value=2
-
cell 5: ("ababa")
- prefixes ("a","ab","aba","abab")
- suffixes ("a","ba","aba","baba")
- matches = "aba" --> value=3
-
cell 7: ("abababc")
- there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will
def kmp_table(s):
"""
Partial match table
O(K): K=length(s)
"""
length = len(s)
partial_match_table = [0]*length
s = list(s)
i,j=0,1
while i<length and j<length:
if s[i]==s[j]: # case 1: the substring continues
partial_match_table[j]=i+1
i+=1
j+=1
else: # case 2: the substring doesn't continues, but can fall back
if i==0:
partial_match_table[j]=0
j+=1
else: # case 3: run out of candates
i = partial_match_table[i-1]
return partial_match_table
s="ababab"
print(kmp_table(s))
# [0, 0, 1, 2, 3, 4]
We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches.
If a partial match of length partial_match_length
is found and table[partial_match_length] > 1
, we may skip ahead partial_match_length - table[partial_match_length - 1]
characters.
example:
matching the pattern "abababca" against the text "bacbababaabcbab"
b | a | c | b | a | b | a | b | a | a | b | c | b | a | b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
l | ||||||||||||||
a | b | a | b | a | b | c | a |
- partial_match_length=1
- table[partial_match_length - 1]=table[0]=0
- don't skip
b | a | c | b | a | b | a | b | a | a | b | c | b | a | b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
l | l | l | l | l | ||||||||||
a | b | a | b | a | b | c | a |
- partial_match_length=5
- table[partial_match_length - 1]=table[4]=3
- skip=partial_match_length-table[4]=5-3=1
b | a | c | b | a | b | a | b | a | a | b | c | b | a | b |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x | x | l | l | l | ||||||||||
a | b | a | b | a | b | c | a |
- partial_match_length=3
- table[partial_match_length - 1]=table[2]=1
- skip=partial_match_length-table[2]=3-1=2
b | a | c | b | a | b | a | b | a | a | b | c | b | a | b | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x | x | l | |||||||||||||
a | b | a | b | a | b | c | a |
- pattern is longer than the remaining characters in the text, so we know there’s no match.
def kmp_search(P, S, kmp_table):
"""
O(n)
P: the pattern(word) sought
S: the text to be searched
return: an integer (the zero-based position in S at which P is found)
"""
m = 0 # the beginning of the current match in S
i = 0 # the position of the current character in P
while m + i < len(S):
if P[i] == S[m+i]:
if i == len(P) - 1:
return m
i += 1
else:
if kmp_table[i] > 0:
m = m + i - kmp_table[i]
i = kmp_table[i]
else:
m += 1
i = 0
return None # have searched all of S unsuccessfully
s="ababab"
print(kmp_search('aba', s, kmp_table(s)))
# 0