CT-363
Design & Analysis of Algorithms
String Matching
String Matching Problem
• Text-editing programs frequently need to find all
occurrences of a pattern in the text.
• Typically, the text is a document being edited, and
the pattern searched for is a particular word
supplied by the user.
• Efficient algorithms for this problem—called
“string matching”
• String matching can grealty aid the
responsiveness of the text-editing program
String Matching Problem
• In Computer Science string searching algorithms,
sometimes called string matching algorithms, that
try to find a place where one or several string
(called patterns) are found within large string or
text.
Application- String Matching Algorithms
• Particular patterns in DNA Sequence.
• Internet search engines
String Matching Problem
• We assume that the text is an array T [1 .. n] of
length n and that the pattern is an array P[1 .. m]
of length m ≤ n.
• We further assume that the elements of P and T
are characters drawn from a finite alphabet Σ.
– For example, we may have Σ = {0, 1} or
Σ = {a, b, . . . , z}.
• The character arrays P and T are often called
strings of characters.
String Matching Problem
• We say that pattern P occurs with shift s in text T (or,
equivalently, that pattern P occurs beginning at position
s + 1 in text T) if
0 ≤ s ≤ n - m and T [s + 1 .. s + m] = P[1 .. m] i.e.
T [s + j] = P[ j], for 1 ≤ j ≤ m).
– If P occurs with shift s in T, we call s a valid shift;
– otherwise, we call s an invalid shift.
String Matching Problem
• The string-matching problem is “finding all valid shifts
with which a given pattern P occurs in a given text T”.
Example: String Matching Problem
Text T a b c a b a a b c a b a c
s=3
Pattern P a b a a
Definitions and Notations
Notation Terminology
Σ* The set of all finite-length strings formed using characters from the
alphabet Σ.
ε The zero-length empty string, also belongs to Σ*.
|x| The length of a string x.
xy The concatenation of two strings x and y has length |x| + |y| and
consists of the characters from x followed by the characters from y.
wx A string w is a prefix of a string x, if x = wy for some string y
Σ*. If w x, then |w| ≤ |x|.
wx A string w is a suffix of a string x, if x = yw for some y Σ*. If w
x that |w| ≤ |x|.
1. Naive Approach
• The idea is based on Brute Force Approach.
• The naive algorithm finds all valid shifts using a loop that
checks the condition P[1 .. m] = T[s + 1 .. s + m] for each
of the n - m + 1 possible values of s.
• It can be interpreted graphically as sliding a
“template“ containing the pattern over the text, noting for
which shifts all of the characters on the template equal
the corresponding characters in the text.
String Matching Algorithms
String Matching Algorithms
• There are many types of String Matching
Algorithms
The Naive string-matching algorithm
The Rabin-Krap algorithm
String matching with finite automata
The Knuth-Morris-Pratt algorithm
• But we will discuss only 2 types, i.e. Naive &
Rabin-Krap
Naive String Matching Algorithm
• The naive algorithm finds all valid shifts using a loop
that checks the condition P[1..m] = T[1..s] for each
of the n-m+1 possible values of s.
NAIVE-STRING-MATCHER(T, P)
1 n ← length[T]
2 m ← length[P]
3 for s ← 0 to n - m
4 do if P[1 .. m] = T[s + 1 .. s + m]
5 then print "Pattern occurs with shift" s
Example: Naive String Matching Algorithm
• Suppose
P = aab
T = acaabc
Find all valid shifts
Example: Naive String Matching Algorithm
n ← length[T] = 6
m ← length[P] = 3
a c a a b c for s ← 0 to n – m (6 - 3 = 3)
P[1] = T[s + 1]
s=0
a a b P[1] = T[1] (As a = a)
P[2] = T[s + 2]
But P[2] T[2] (As a c)
Example: Naive String Matching Algorithm
a c a a b c for s ← 1
P[1] = T[s + 1]
s=1
a a b But P[1] T[2] (As a c)
Example: Naive String Matching Algorithm
for s ← 2
P[1] = T[s + 1]
P[1] = T[3] (As a = a)
a c a a b c
s=2 P[2] = T[s + 2]
a a b P[2] = T[4] (As a = a)
P[3] = T[s + 3]
P[3] = T[5] (As b = b)
Example: Naive String Matching Algorithm
for s ← 3
a c a a b c P[1] = T[s + 1]
P[1] = T[4] (As a = a)
s=3
a a b
P[2] = T[s + 2]
But P[2] T[5] (As a b)
Example: Naive String Matching Algorithm
Naive String Matching Algorithm
• Worst case Running Time
– Outer loop: n – m + 1
– Inner loop: m
– Total ((n - m + 1)m)
• Best-case: n-m
Note
• Not an optimal procedure for String Matching problem.
• It has high running time for worst case.
• The naive string-matcher is inefficient because
information gained about the text for one value of s is
entirely ignored in considering other values of s.
2. The Rabin-Karp Algorithm
• It compares string’s hash values, rather than string
themselves.
• Perform well in practice, and generalized to other
algorithms for related problems, such as two-dimensional
pattern matching.
2. The Rabin-Karp Algorithm
Special Case
• Given a text T [1 .. n] of length n, a pattern P[1 .. m] of
length m ≤ n, both as arrays.
• Assume that elements of P and T are characters drawn
from a finite set of alphabets Σ.
• Where Σ = {0, 1, 2, . . . , 9}, so that each character is a
decimal digit.
• Now our objective is “finding all valid shifts with which
a given pattern P occurs in a text T”.
Notations: The Rabin-Karp Algorithm
Let us suppose that
• p denotes decimal value of given a pattern P[1 .. m]
• ts = decimal value of length-m substring T[s + 1 .. s + m],
of given text T [1 .. n], for s = 0, 1, ..., n - m.
• It is very obvious that, ts = p if and only if
T [s + 1 .. s + m] = P[1 .. m];
thus, s is a valid shift if and only if ts = p.
• Now the question is how to compute p and ts efficiently
• Answer is Horner’s rule
Horner’s Rule
Example: Horner’s rule
[3, 4, 5] = 5 + 10(4 + 10(3)) = 5 + 10(4 + 30) = 5+340 =
345
p = P[3] + 10 (P[3 - 1] + 10(P[1])).
Formula
• We can compute p in time Θ(m) using this rule as
p = P[m] + 10 (P[m-1] + 10(P[m-2] + … + 10(P[2] + 10P[1]) ))
• Similarly t0 can be computed from T [1 .. m] in time Θ(m).
• To compute t1, t2, . . . , tn-m in time Θ(n - m), it suffices to
observe that ts+1 can be computed from ts in constant time.
Computing ts+1 from ts in constant time
• Text = [3, 1, 4, 1, 5, 2]; t0 = 31415
• m = 5; Shift = 0
3 1 4 1 5 2
• Old higher-order digit = 3
• New low-order digit = 2
• t1 = 10.(31415 – 104.T(1)) + T(5+1)
= 10.(31415 – 104.3) + 2
= 10(1415) + 2 = 14152
• ts+1 = 10(ts – T[s + 1] 10m-1 ) + T[s + m + 1])
• t1 = 10(t0 – T[1] 104) + T[0 + 5 + 1])
• Now t1, t2, . . . , tn-m can be computed in Θ(n - m)
Procedure: Computing ts+1 from ts
1. Subtract T[s + 1]10m-1 from ts, removes high-order digit
2. Multiply result by 10, shifts the number left one position
3. Add T [s + m + 1], it brings appropriate low-order digit.
ts+1 = (10(ts – T[s + 1] 10m-1 ) + T[s + m + 1])
Another issue and its treatment
• The only difficulty with the above procedure is that p and
ts may be too large to work with conveniently.
• Fortunately, there is a simple cure for this problem,
compute p and the ts modulo a suitable modulus q.
Computing ts+1 from ts Modulo q = 13
A window of length 5 is shaded.
The numerical value of window = 31415
31415 mod 13 = 7
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
mod 13
7
Spurious Hits and their Elimination
• m = 5.
• p = 31415,
• Now, 31415 ≡ 7 (mod 13)
• Now, 67399 ≡ 7 (mod 13)
• Window beginning at position 7 = valid match; s = 6
• Window beginning at position 13 = spurious hit; s = 12
• After comparing decimal values, text comparison is
needed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1
… … … mod 13
8 9 3 11 0 1 7 8 4 5 10 11 7 9 11
Valid match Spurious hit
2. The Rabin-Karp Algorithm
Generalization
• Given a text T [1 .. n] of length n, a pattern P[1 .. m] of
length m ≤ n, both as arrays.
• Assume that elements of P and T are characters drawn
from a finite set of alphabets Σ = {0, 1, 2, . . . , d-1}.
• Now our objective is “finding all valid shifts with which
a given pattern P occurs in a text T”.
Note
• ts+1 = (d(ts – T[s + 1]h) + T[s + m + 1]) mod q
where h = dm-1 (mod q) is the value of the digit “1” in the
high-order position of an m-digit text window.
Sequence of Steps Designing Algorithm
1. Compute the lengths of pattern P and text T
2. Compute p and ts under modulo q using Horner’s Rule
3. For any shift s for which ts ≡ p (mod q), must be tested
further to see if s is really valid shift or a spurious hit.
4. This testing can be done by checking the condition:
P[1 .. m] = T [s + 1 .. s + m]. If these strings are equal s
is a valid shift otherwise spurious hit.
5. If for shift s, ts ≡ p (mod q) is false, compute ts+1 and
replace it with ts and repeat the step 3.
Note
• As ts ≡ p (mod q) does not imply that ts = p, hence text
comparison is required to find valid shift
2. The Rabin-Karp Algorithm
RABIN-KARP-MATCHER(T, P, d, q)
1 n ← length[T]
2 m ← length[P]
3 h ← dm-1 mod q
4 p←0
5 t0 ← 0
6 for i ← 1 to m Preprocessing.
7 do p ← (dp + P[i]) mod q
8 t0 ← (dt0 + T[i]) mod q
9 for s ← 0 to n - m Matching.
10 do if p = ts
11 then if P[1 .. m] = T [s + 1 .. s + m]
12 then print "Pattern occurs with shift" s
13 if s < n - m
14 then ts+1 ← (d(ts - T[s + 1]h) + T[s + m + 1]) mod q
Analysis: The Rabin-Karp Algorithm
• Worst case Running Time
– Preprocessing time: Θ(m)
– Matching time is Θ((n – m + 1)m)
• If P = am, T = an, verifications take time Θ((n - m + 1)m),
since each of the n - m + 1 possible shifts is valid.
• In applications with few valid shifts, matching time of the
algorithm is only O((n - m + 1) + cm) = O(n + m), plus
the time required to process spurious hits.
Summary
4. The Knuth-Morris-Pratt algorithm
Algorithm Preprocessing Matching
Time Time
Naive 0 O((n-m+1)m)
Rabin-Karp (m) O((n-m+1)m)
Finite Automaton O(m| | ) (n)