KMP Algorithm
What is KMP?
The Knuth-Morris-Pratt (KMP) algorithm is a string matching algorithm designed for searching substrings or patterns within a main text string efficiently.
It was developed by Donald Knuth, Vaughan Pratt, and James H. Morris in 1977 (where thses three names appears on the algorithm name). The key advantage of the KMP algorithm over naive substring search methods is its linear time complexity, which makes it much more efficient, especially when dealing with large texts or patterns.
Efficiency and Use Cases
The overall time complexity of the KMP algorithm is 𝑂(𝑛+𝑚), which is optimal for the problem of string matching. Because it eliminates the possibility of redundant checks found in naive string matching algorithms, KMP is particularly useful in applications where repetitive and large-scale text scanning is required, such as:
- Text editing software for find and replace functions.
- Search engines for matching keywords in texts or documents.
- Data analysis tools that scan large datasets for patterns or anomalies.
KMP is a foundational technique in the field of computer science, particularly within the realms of text processing and data retrieval. Its development marked a significant advancement in the efficiency of pattern searching algorithms, making operations that involve string matching significantly faster and more scalable.
How KMP Works
- Preprocessing Phase (Building the LPS Array):
- KMP begins by preprocessing the pattern to create the Longest Prefix which is also Suffix (LPS) array. The LPS array stores the lengths of the longest proper prefix of the pattern which is also a suffix for each position within the pattern. This preprocessing is what allows the algorithm to skip unnecessary comparisons.
- The LPS array preparation has a time complexity of 𝑂(𝑚), where 𝑚m is the length of the pattern.
- Searching Phase:
- The searching phase uses the LPS array to avoid re-checking previously matched characters. When a mismatch occurs after some matches, the pattern is shifted according to the last “good” position recorded in the LPS array. This way, the algorithm ensures that it leverages the match information already gained and reduces the number of comparisons.
- The searching phase has a time complexity of 𝑂(𝑛), where 𝑛n is the length of the text.
How to build the LPS Array
For example, we have a pattern aabaaf
.
What is Prefix
A prefix refers to any contiguous leading subsequence of a pattern.
More specifically, it can be any initial part of the pattern, starting from the first character and extending up to any point within the pattern, but not including the entire pattern itself. This is to distinguish it from a “proper prefix,” which explicitly excludes the entire sequence itself as a prefix.
Example prefix for pattern aabaaf
:
a
aa
aab
aaba
aabaa
What is Suffix
A suffix refers to any contiguous trailing subsequence of a pattern. A suffix is essentially any ending part of the pattern, starting from any point within the pattern and extending to the last character.
Example suffix for pattern aabaaf
:
f
af
aaf
baaf
abaaf
aabaaf
Each suffix includes all consecutive characters up to the end of the string from a specific starting point within the string.
Build the LPS Array using Prefix and Suffix
We use pattern aabaaf
as example:
For each index we iterate, we treat them as a substring and find the longest match part from prefix and suffix of this substring as the figure above shown. The number we put in this index of the LPS array is the length of this matched part, which also can be think as the index where we last find this substring pair.
For example when i
is 4, we have longest prefix-suffix match aa
, and put 2 in LPS[4]
. This 2
means if we failed to pair the character in next string, we don’t need to re-match the string at the start. We could just start after 2nd
character since we already validate aa
when we did on the index 4
.
As you can see, when we failed to match character, we find the index to re-match at the j - 1
index, and re-match character at updated j + 1
index.
To make process simpler, LPS array usually shift right by one index so that each time we failed to match character, we just look up the value at that index j
instead of j - 1
. Thus, [0, 1, 0, 1, 2, 0]
becomes [x, 0, 1, 0, 1, 2]
But, 2
at here means the second character, which is at index 1
, each time we have to calculate index by decrese 1 number. Some people also simply this by decrease all values in LPS array by 1 which will looks like the following:
Both three versions can be used when you implementing KMP Algorithm, but use proper calculations when dealing each version.