String matching algorithms
Deliverables
String Basics
Naïve String matching Algorithm
Boyer Moore Algorithm
Rabin-Karp Algorithm
Knuth-Morris- Pratt Algorithm
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 2
String Basics
A string is a sequence of characters
Examples of strings: C++ program, HTML
document, DNA sequence, Digitized image
An alphabet S is the set of possible characters for a
family of strings
Example of alphabets: ASCII (used by C and C++),
Unicode (used by Java), {0, 1}, {A, C, G, T}
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 3
String Basics
Let P be a string of size m
A substring P[i .. j] of P is the subsequence of P consisting of
the characters with ranks between i and j
A prefix of P is a substring of the type P[0 .. i]
A suffix of P is a substring of the type P[i ..m - 1]
Given strings T (text) and P (pattern), pattern matching
problem consists of finding a substring of T equal to P
Applications: Text editors, Search engines, Biological
research
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 4
Brute Force String Matching
Naive-String-Matcher(T, P)
1. n ← length[T]
2. m ← length[P]
3. for s ← 0 to n - m
4. do if P[1 .. m]=T[s+1..s+m]
5. then print "Pattern occurs
with shift" s Match Q with A, not matching, so
Worst case O(m*n) shift the pattern by one and so on.
T = aaa … ah
P = aaah
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 5
Boyer-Moore Algorithm
It uses two heuristics
Looking-glass heuristic: Compare P
with T moving backwards
Character-jump heuristic: When a
mismatch occurs at T[i] = c
If P contains c, shift P to align the last
occurrence of c in P with T[i]
Else, shift P to align P[0] with T[i + 1]
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 6
Boyer Moore Algorithm
Character shift
M 0
• Boyer-Moore’s runs in time O(nm + s) H 1
T 2
• Example of worst case: T = aaa …a P = baaa I
R
3
4
• Worst case may occur in images and DNA sequences but
unlikely in English text so BM is better for English Text
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m
2 4 6
r i t h m r i t h m r i t h m
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 7
Boyer Moore Algorithm
Algorithm BoyerMooreMatch(T, P, S) else
L lastOccurenceFunction(P, S ) { character-jump }
im-1 l L[T[i]]
jm-1 i i + m – min(j, 1 + l)
repeat jm-1
if T[i] = P[j] until i > n - 1
if j = 0 return -1 { no match }
return i { match at i }
else
ii-1
jj-1
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 8
Example
• Text: Does the text match the pattern?
• Patter: tern
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 9
Boyer Moore Algorithm
• Bad Character
– Upon mismatch, skips alignments until
• mismatch becomes a match or
• p moves past mismatched character
• Good Suffix Rule
– Substring matched by inner loop; skip until
• There are no mismatch between p and substring
• p moves past substring.
Ref: https://www.youtube.com/watch?v=4Xyhb72LCX4
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 10
Rabin-Karp Algorithm
• It speeds up testing of equality of the pattern to the
substrings in the text by using a hash function.
• A hash function is a function which converts every string
into a numeric value, called its hash value; e.g.
hash("hello")=5. It exploits the fact that if two strings are
equal, their hash values are also equal.
• There are two problems: Different strings can also result in
the same hash value and there is extra cost of calculating
the hash for each group of strings.
• Worst case is O(mn)
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 11
How Rabin-Karp works
• Let characters in both arrays T and P be digits in radix-S
notation. (S = (0,1,...,9)
• Let p be the value of the characters in P
• Choose a prime number q such that fits within a computer
word to speed computations.
• Compute (p mod q)
– The value of p mod q is what we will be using to find all matches
of the pattern P in T.
How Rabin-Karp works (continued)
• Compute (T[s+1, .., s+m] mod q) for s = 0 .. n-m
• Test against P only those sequences in T having the same
(mod q) value
• (T[s+1, .., s+m] mod q) can be incrementally computed
by subtracting the high-order digit, shifting, adding the
low-order bit, all in modulo q arithmetic.
A Rabin-Karp example
• Given T = 31415926535 and P = 26
• We choose q = 11
• P mod q = 26 mod 11 = 4
3 1 4 1 5 9 2 6 5 3 5
31 mod 11 = 9 not equal to 4
3 1 4 1 5 9 2 6 5 3 5
14 mod 11 = 3 not equal to 4
3 1 4 1 5 9 2 6 5 3 5
41 mod 11 = 8 not equal to 4
Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5
15 mod 11 = 4 equal to 4 -> spurious hit
3 1 4 1 5 9 2 6 5 3 5
59 mod 11 = 4 equal to 4 -> spurious hit
3 1 4 1 5 9 2 6 5 3 5
92 mod 11 = 4 equal to 4 -> spurious hit
3 1 4 1 5 9 2 6 5 3 5
26 mod 11 = 4 equal to 4 -> an exact match!!
3 1 4 1 5 9 2 6 5 3 5
65 mod 11 = 10 not equal to 4
Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5
53 mod 11 = 9 not equal to 4
3 1 4 1 5 9 2 6 5 3 5
35 mod 11 = 2 not equal to 4
As we can see, when a match is found, further testing
is done to insure that a match has indeed been
found.
Rolling Hash
• A rolling hash allows an algorithm to calculate a hash
value without having the rehash the entire string.
• For example, when searching for a word in a text, as the
algorithm shifts one letter to the right instead of having
to calculate the hash of the section.
• Eg: Convert a text from a[8-9]=53 to a[9-10]=35.
a[9-10] = a[8-9]-a[8]+a[10]
a[9-10] = ((5*101 + 3*100) – (5*101)) *101+ (5*10o)
a[9-10] = (53-50)*101+ 5 = 35
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 17
Rabin Karp Example
Here we are matching 31415 in the given string. We can have various valid hits
but there may be few valid matches.
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 18
KMP Algorithm
• It never re-compares a text symbol that has matched a pattern
symbol. As a result, complexity of the searching phase of the
KMP = O(n). Preprocessing phase has a complexity of O(m).
Since m< n, the overall complexity of is O(n). A border of x is a
substring that is both proper prefix and proper suffix of x. We
call its length b the width of the border.
• Let x=abacab Proper prefixes of x are ε, a, ab, aba, abac, abaca
• The proper suffixes of x are ε, b, ab, cab, acab, bacab
• The borders of x are ε, ab
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 19
KMP-Preprocess Algorithm
void kmpPreprocess(){ For pattern p = ababaa the
int i=0, j=-1; widths of the borders in
b[i]=j;
array b have the following
while (i<m) {
values.
while (j>=0 && p[i]!=p[j])
j=b[j]; For instance we
i++; have b[5] = 3, since the
j++; prefix ababa of length 5 has
b[i]=j; a border of width 3
}}
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 23
Preprocessing
pre-processing algorithm
could be applied to the
string pt instead of p. If
borders up to a width of m are
computed only, then a border
of width m of some
prefix x of pt corresponds to a
match of the pattern
in t (provided that the border
is not self-overlapping)
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 24
KMP Search Algorithm
void kmpSearch() {
int i=0, j=0;
while (i<n){
while (j>=0 && t[i]!=p[j])
j=b[j];
i++;
j++;
if (j==m) {
report(i-j);
j=b[j];
}}}
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 25
KMP Search Algorithm
• When in inner while loop a mismatch at position j occurs, the
widest border of the matching prefix of length j of the pattern
is considered. Resuming comparisons at position b[j], the
width of the border, yields a shift of the pattern such that the
border matches. If again a mismatch occurs, the next-widest
border is considered, and so on, until there is no border left or
the next symbol matches. Then we have a new matching prefix
of the pattern and continue with the outer while loop.
• If all m symbols of the pattern have matched the
corresponding text window (j = m), a function report is called
for reporting the match at position i-j. Afterwards, the pattern
is shifted as far as its widest border allows
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 26
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
x 0 1 2 3 4 5
a b a c a b
P[x] a b a c a b
14 15 16 17 18 19
f(x) 0 0 1 0 1 2
a b a c a b
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 27
KMP Video
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 28
Questions, comments and Suggestions
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 29
Question 1
How many nonempty prefixes of the string
p=“aaabbaaa” are also suffixes of P?
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 30
Question 2
What is the longest proper prefix of the string
cgtacgttcgtacg that is also the suffix of this string.
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 31
Question 3
What is the complexity of the KMP Algorithm, if we have a
main string of length s and we wish to find the pattern of
length p.
A) O( s+p)
B) O(p)
C) O(sp)
D) O(s)
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 32